A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… ·...

88
1/36 A Variational Parametric Model for Audio Synthesis Krishna Subramani Supervised by Prof. Preeti Rao Electrical Engineering Indian Institute of Technology Bombay, India

Transcript of A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… ·...

Page 1: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

136

A Variational Parametric Model forAudio Synthesis

Krishna SubramaniSupervised by Prof Preeti Rao

Electrical EngineeringIndian Institute of Technology Bombay India

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

Figure One of the early Moog Modular Synthesizers

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I More generally it involves us specifying controlling parametersto a synthesizer to obtain an audio output

I What are the main parameters which govern the audiogeneration(at a high level)

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

1 Timbre

2 Pitch

3 Loudness

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

336

Generative Models for Audio Synthesis

I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data

I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods

I Most methods in literature try to model audio signals directlyin the time or frequency domain

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 2: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

Figure One of the early Moog Modular Synthesizers

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I More generally it involves us specifying controlling parametersto a synthesizer to obtain an audio output

I What are the main parameters which govern the audiogeneration(at a high level)

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

1 Timbre

2 Pitch

3 Loudness

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

336

Generative Models for Audio Synthesis

I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data

I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods

I Most methods in literature try to model audio signals directlyin the time or frequency domain

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 3: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

Figure One of the early Moog Modular Synthesizers

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I More generally it involves us specifying controlling parametersto a synthesizer to obtain an audio output

I What are the main parameters which govern the audiogeneration(at a high level)

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

1 Timbre

2 Pitch

3 Loudness

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

336

Generative Models for Audio Synthesis

I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data

I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods

I Most methods in literature try to model audio signals directlyin the time or frequency domain

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 4: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I More generally it involves us specifying controlling parametersto a synthesizer to obtain an audio output

I What are the main parameters which govern the audiogeneration(at a high level)

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

1 Timbre

2 Pitch

3 Loudness

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

336

Generative Models for Audio Synthesis

I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data

I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods

I Most methods in literature try to model audio signals directlyin the time or frequency domain

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 5: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

1 Timbre

2 Pitch

3 Loudness

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

336

Generative Models for Audio Synthesis

I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data

I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods

I Most methods in literature try to model audio signals directlyin the time or frequency domain

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 6: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

336

Generative Models for Audio Synthesis

I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data

I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods

I Most methods in literature try to model audio signals directlyin the time or frequency domain

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 7: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

236

Audio Synthesis

I What comes to your mind when you hear lsquoAudio Synthesisrsquo

I Early analog synthesizers used voltage controlled oscillatorsfilters amplifiers to generate the waveform and lsquoenvelopegeneratorsrsquo to shape it

I Data-driven statistical modeling + computing power=rArr Deep Learning for audio synthesis

336

Generative Models for Audio Synthesis

I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data

I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods

I Most methods in literature try to model audio signals directlyin the time or frequency domain

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 8: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

336

Generative Models for Audio Synthesis

I Rely on ability of algorithms to extract musically relevantinformation from vast amounts of data

I Autoregressive modeling Generative Adversarial Networks andVariational Autoencoders are some of the proposed generativemodeling methods

I Most methods in literature try to model audio signals directlyin the time or frequency domain

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 9: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 10: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 11: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 12: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 13: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]

X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 14: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 15: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 16: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

436

Our Nearest Neighbours

I [Sarroff and Casey 2014] first to use autoencoders to performframe-wise reconstruction of short-time magnitude spectra

X Learn perceptually relevant lower dimensionalrepresentations(lsquolatent spacersquo) of audio to help in synthesis

times lsquoGraininessrsquo in the reconstructed sound

I [Roche et al 2018] extended this analysis to try out differentautoencoder architectures

X Helped by the release of NSynth [Engel et al 2017]X Usability of the latent space in audio interpolation

I [Esling et al 2018] regularized the VAE latent space in orderto effect control over perceptual timbre of synthesizedinstruments

X Enabled them to lsquogeneratersquo and lsquointerpolatersquo audio in this space

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 17: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 18: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 19: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative sounds

times Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 20: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 21: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 22: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 23: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

536

Our Nearest Neighbours

I Major issue with previous methods was frame-wiseanalysis-synthesis based reconstruction which fails to modeltemporal evolution

I [Engel et al 2017] inspired by Wavenets [Oord et al 2016]autoregressive modeling capablities for speech extended it tomusical instrument synthesis

X Generation of realistic and creative soundstimes Complex network which was difficult to train and sample from

I [Wyse 2018] also autoregressively modelled the audio albeitby conditioning the waveform samples on additional parameterslike pitch velocity(loudness) and instrument class

X Was able to achieve better control over generation byconditioning

times Unable to generalize to untrained pitches

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 24: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 25: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

636

Why Parametric

I Rather than generating new timbres(lsquointerpolatingrsquo acrosssounds) we consider the problem of synthesis of a giveninstrument sound with flexible control over the pitch andloudness dynamics

I Pitch shifting without timbre modification uses a source-filtermodel with the filter(spectral envelope) being kept constant[Roebel and Rodet 2005]

I A powerful parametric representation over raw waveform orspectrogram has the potential to achieve high quality with lesstraining data

1 [Blaauw and Bonada 2016] recognized this in context ofspeech synthesis and used a vocoder representation to train agenerative model achieving promising results along the way

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 26: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 27: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

736

Dataset

I Good-sounds dataset [Romani Picas et al 2015] consisting ofindividual note and scale recordings for 12 different instruments

I We work with the lsquoViolinrsquo played in mezzo-forte loudnessand choose the 4th octave(MIDI 60-71)

Why we chose ViolinPopular in Indian Music Human voice-like timbre

Ability to produce continuous pitch

0 5 10 15 20 25 30 35

606162636465666768697071

282626262626

343030

262626

Instances

MID

I

Figure Instances per note in the overall dataset

I Average note duration for the chosen octave is about 45s pernote

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 28: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

836

Dataset

I We split the data to train(80) and test(20) instances acrossMIDI note labels

I Our model is trained with frames(duration 213ms) from thetrain instances and system performance is evaluated withframes from the test instances

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 29: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 30: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 31: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

936

Non-Parametric Reconstruction

I Framewise magnitude spectra reconstruction procedure in[Roche et al 2018] cannot generalize to untrained pitches

I We consider two cases - Train including and excluding MIDI 63along with its 3 neighbouring pitches on either side

Sustain

FFT

En

cod

er

Z

Dec

od

er

AE

I Repeat above across all input spectral frames and invertobtained spectrogram using Griffin-Lim [Griffin and Lim 1984]

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 32: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1036

Non-Parametric Reconstruction

Figure Input MIDI 63 1 1

Figure Including MIDI 63 2 2 Figure Excluding MIDI 63 3 3

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

1728

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton5)ocgs[i]state=false

1728

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 33: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 34: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 35: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1136

Parametric Model

1 Frame-wise magnitude spectrum rarr harmonic representationusing Harmonic plus Residual(HpR) model[Serra et al 1997](currently we neglect the residual)

Sustain

FFTHpR

f0Magnitudes

I Output of HpR block =rArr log-dB magnitudes + harmonics

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 36: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 37: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 38: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1236

Parametric Model

2 log-dB magnitudes + harmonics rarr TAE algorithm[Roebel and Rodet 2005 IMAI 1979]

I TAE =rArr Iterative Cepstral Liftering

A0(k) = log(|X (k)|)V0 = minusinfinwhile(Ai minus A0 lt ∆)Ai (k) = max(Aiminus1(k)Viminus1(k))

Vi = FFT (lifter(Ai ))

f0Magnitudes

TAE

CCs

f0

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 39: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1336

Parametric Model

I No open source implementation available for the TAE thus weimplemented it following procedure highlighted in[Roebel and Rodet 2005 Caetano and Rodet 2012]

I Figure shows a TAE snap from [Caetano and Rodet 2012]

I Similar to results we get 1 4 2 5

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton6)ocgs[i]state=false

3216

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton7)ocgs[i]state=false

2256

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 40: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1436

Parametric Model

0 1 2 3 4 5minus80

minus60

minus40

minus20

Frequency(kHz)Magnitude(dB)

606571

I Spectral envelope shape varies across pitch

1 Dependence of envelope on pitch[Slawson 1981 Caetano and Rodet 2012]

2 Variation due the TAE algorithm

I TAE rarr smooth function to estimate harmonic amplitudes

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 41: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1536

Parametric Model

I No of CCs(Cepstral CoefficientsKcc) depends on Samplingrate(Fs)pitch(f0)

Kcc leFs2f0

I For our network we choose the maximum Kcc(lowest pitch) =91 and zero pad the high pitch Kcc to 91 dimension vectors

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 42: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 43: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 44: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 45: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 46: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 47: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1636

Generative Models

I Autoencoders [Hinton and Salakhutdinov 2006] - Minimizethe MSE(Mean Squared Error) between input and networkreconstruction

bull Simple to train and performs good reconstruction(since theyminimize MSE)

bull Not truly a generative model as you cannot generate new data

I Variational Autoencoders [Kingma and Welling 2013] -Inspired from Variational Inference enforce a prior on thelatent space

bull Truly generative as we can lsquogeneratersquo new data by samplingfrom the prior

I Conditional Variational Autoencoders[Doersch 2016 Sohn et al 2015] - Same principle as a VAEhowever learns the conditional distribution over an additionalconditioning variable

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 48: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 49: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 50: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 51: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1736

Generative Models

Why VAE over AE

Continuous latent space from which we can sample points(andsynthesize the corresponding audio)

Why CVAE over VAE

Conditioning on pitch =rArr Network captures dependenciesbetween the timbre and the pitch =rArr More accurate envelopegeneration + Pitch control

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 52: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1836

Network Architecture

CCs

f0

En

cod

er

Z

Dec

od

er

CVAE

I Network input is CCs rarr MSE represents perceptually relevantdistance in terms of squared error between the input andreconstructed log magnitude spectral envelopes

I We train the network on frames from the train instances

I For evaluation MSE values calculated ahead is the averagereconstruction error across all the test instance frames

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 53: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 54: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

L prop MSE + βKLD

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 55: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 56: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

1936

Network ArchitectureI Main hyperparameters -

1 β - Controls relative weighting between reconstruction andprior enforcement

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

β = 001β = 01β = 1

Figure CVAE varying β

I Tradeoff between both terms choose β = 01

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 57: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 58: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 59: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2036

Network Architecture

I Main hyperparameters -

2 Dimensionality of latent space - networks reconstruction ability

0 10 20 30 40 50 60 70

2

4

6

8middot10minus5

Latent space dimensionality

MSE

CVAEAE

Figure CVAE(β = 01) vs AE

I Steep fall initially flatter later Choose dimensionality = 32

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 60: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2136

Network Architecture

I Network size [91 91 32 91 91]I 91 is the dimension of input CCs 32 is latent space

dimensionality

I Linear Fully Connected Layers + Leaky ReLU activations

I ADAM [Kingma and Ba 2014] with initial lr = 10minus3

I Training for 2000 epochs with batch size 512

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 61: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 62: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 63: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2236

Experiments

I Two kinds of experiments to demonstrate networks capabilities

1 Reconstruction - Omit pitch instances during training and seehow well model reconstructs notes of omitted target pitch

2 Generation - How well model lsquosynthesizesrsquo note instances withnew unseen pitches

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 64: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 65: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 66: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 67: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2336

Reconstruction

I Two training contexts -

1 T(times) is target X is training instancesMIDI T - 3 T - 2 T - 1 T T + 1 T + 2 T + 3Kept X X X times X X X

2 Octave endpointsMIDI 60 61 62 63 64 65Kept X times times times times timesMIDI 66 67 68 69 70 71Kept times times times times times X

I In each of the above cases we compute the MSE as theframe-wise spectral envelope match across all frames of all thetarget instances

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 68: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2436

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAE

I Both AE and CVAE reasonably reconstruct the target pitch1 6 2 7 3 8

I CVAE produces better reconstruction especially when thetarget pitch is far from the pitches available in the trainingdata(plot above) 4 9 5 10 6 11

I Conditioning helps to capture the pitch dependency of thespectral envelope more accurately

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton8)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton9)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton10)ocgs[i]state=false

1704

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton11)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton12)ocgs[i]state=false

1536

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton13)ocgs[i]state=false

1536

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 69: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2536

Reconstruction

I To emulate the effect of pitch conditioning with an AE wetrain the AE by appending the pitch to the input CCs andreconstructing this appended input as shown below

f0

CCscAE

fprime

0

CCsprime

Figure lsquoConditionalrsquo AE(cAE)

I [Wyse 2018] followed a similar approach of appending theconditional variables to the input of his model

I the lsquocAErsquo is comparable to our proposed CVAE in that thenetwork might potentially learn something from f0

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 70: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2636

Reconstruction

60 62 64 66 68 70 72

1

2

3

4

5middot10minus2

Pitch

MSE

AECVAEcAE

I We train the cAE on the octave endpoints and evaluateperformance as shown above

I Appending f0 does not improve performance It is infact worsethan AE

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 71: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2736

Generation

I We are interested in lsquosynthesizingrsquo audio

I We see how well the network can generate an instance of adesired pitch(which the network has not been trained on)

Z

Dec

od

er

f0

Figure Sampling from the Network

I We follow the procedure in [Blaauw and Bonada 2016] - arandom walk to sample points coherently from the latent space

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 72: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2836

Generation

I We train on instances across the entire octave sans MIDI 65and then generate MIDI 65

I On listening we can see that the generated audio still lacks thesoft noisy sound of the violin bowing

1 12 2 13 3 14

I For a more realistic synthesis we generate a violin note withvibrato 4 15

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton14)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton15)ocgs[i]state=false

2256

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton16)ocgs[i]state=false

2184

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton17)ocgs[i]state=false

2184

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 73: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 74: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 75: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

2936

Putting it all together

I We explored autoencoder frameworks in generative models foraudio synthesis of instrumental tones

I Our parametric representation decouples lsquotimbrersquo and lsquopitchrsquothus relying on the network to model the inter-dependencies

I Pitch conditioning allows to generate the learnt spectralenvelope for that pitch thus enabling us to vary the pitchcontour continuously

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 76: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 77: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 78: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 79: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 80: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 81: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 82: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3036

Putting it all togetherI Our model is still far from perfect however and we have to

improve it further

1 An immediate thing to do will be to model the residual of theinput audio This will help in making the generated audio soundmore natural

2 We have not taken into account dynamics A more completesystem will involve capturing spectral envelope(timbral)dependencies on both pitch and loudness dynamics

3 Conducting more formal listening tests involving synthesis oflarger pitch movements or melodic elements from IndianClassical music

I Our contributions

1 To the best of our knowledge we have not come across anywork using a parametric model for musical tones in the neuralsynthesis framework especially exploiting the conditioningfunction of the CVAE

2 The TAE method has no known PYTHON implementation sowe plan to make our code open-source to the MIR communityfor research

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 83: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3136

References I

[Blaauw and Bonada 2016] Blaauw M and Bonada J (2016)Modeling and transforming speech using variational autoencodersIn Interspeech pages 1770ndash1774

[Caetano and Rodet 2012] Caetano M and Rodet X (2012)A source-filter model for musical instrument sound transformationIn 2012 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP) pages 137ndash140 IEEE

[Doersch 2016] Doersch C (2016)Tutorial on variational autoencodersarXiv preprint arXiv160605908

[Engel et al 2017] Engel J Resnick C Roberts A Dieleman S Norouzi MEck D and Simonyan K (2017)Neural audio synthesis of musical notes with wavenet autoencodersIn Proceedings of the 34th International Conference on Machine Learning-Volume70 pages 1068ndash1077 JMLR org

[Esling et al 2018] Esling P Bitton A et al (2018)Generative timbre spaces regularizing variational auto-encoders with perceptualmetricsarXiv preprint arXiv180508501

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 84: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3236

References II

[Griffin and Lim 1984] Griffin D and Lim J (1984)Signal estimation from modified short-time fourier transformIEEE Transactions on Acoustics Speech and Signal Processing 32(2)236ndash243

[Hinton and Salakhutdinov 2006] Hinton G E and Salakhutdinov R R (2006)Reducing the dimensionality of data with neural networksscience 313(5786)504ndash507

[IMAI 1979] IMAI S (1979)Spectral envelope extraction by improved cepstrumIEICE 62217ndash228

[Kingma and Ba 2014] Kingma D P and Ba J (2014)Adam A method for stochastic optimizationarXiv preprint arXiv14126980

[Kingma and Welling 2013] Kingma D P and Welling M (2013)Auto-encoding variational bayesarXiv preprint arXiv13126114

[Oord et al 2016] Oord A v d Dieleman S Zen H Simonyan K Vinyals OGraves A Kalchbrenner N Senior A and Kavukcuoglu K (2016)Wavenet A generative model for raw audioarXiv preprint arXiv160903499

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 85: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3336

References III

[Roche et al 2018] Roche F Hueber T Limier S and Girin L (2018)Autoencoders for music sound modeling a comparison of linear shallow deeprecurrent and variational modelsarXiv preprint arXiv180604096

[Roebel and Rodet 2005] Roebel A and Rodet X (2005)Efficient Spectral Envelope Estimation and its application to pitch shifting andenvelope preservationIn International Conference on Digital Audio Effects pages 30ndash35 Madrid Spaincote interne IRCAM Roebel05b

[Romani Picas et al 2015] Romani Picas O Parra Rodriguez H Dabiri DTokuda H Hariya W Oishi K and Serra X (2015)A real-time system for measuring sound goodness in instrumental soundsIn Audio Engineering Society Convention 138 Audio Engineering Society

[Sarroff and Casey 2014] Sarroff A M and Casey M A (2014)Musical audio synthesis using autoencoding neural netsIn ICMC

[Serra et al 1997] Serra X et al (1997)Musical sound modeling with sinusoids plus noiseMusical signal processing pages 91ndash122

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 86: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3436

References IV

[Slawson 1981] Slawson W (1981)The color of sound a theoretical study in musical timbreMusic Theory Spectrum 3132ndash141

[Sohn et al 2015] Sohn K Lee H and Yan X (2015)Learning structured output representation using deep conditional generative models

In Advances in neural information processing systems pages 3483ndash3491

[Wyse 2018] Wyse L (2018)Real-valued parametric conditioning of an rnn for interactive sound synthesisarXiv preprint arXiv180510808

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 87: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3536

Audio examples description I

1 Input MIDI 63 to Spectral Model

2 Spectral Model Reconstruction(trained on MIDI63)

3 Spectral Model Reconstruction(not trained on MIDI63)

4 Input MIDI 60 note to Parametric Model

5 Parametric Reconstruction of input note

6 Input MIDI 63 Note

7 Parametric AE reconstruction of input

8 Parametric CVAE reconstruction of input

9 Input MIDI 65 note(endpoint trained model)

10 Parametric AE reconstruction of input(endpoint trained model)

11 Parametric CVAE reconstruction of input(endpoint trained model)

12 CVAE Generated MIDI 65 Violin note

13 Similar MIDI 65 Violin note from dataset

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1
Page 88: A Variational Parametric Model for Audio Synthesiskrishnasubramani/data/ddp_stage… · analysis-synthesis based reconstruction which fails to model temporal evolution I[Engel et

3636

Audio examples description II

14 CVAE Reconstruction of the MIDI 65 violin note

15 CVAE Generated MIDI 65 Violin note with vibrato

  • Notes
      1. fdrm17
      2. fdrm16
      3. fdrm15
      4. fdrm14
      5. fdrm13
      6. fdrm12
      7. fdrm11
      8. fdrm10
      9. fdrm9
      10. fdrm8
      11. fdrm7
      12. fdrm6
      13. fdrm5
      14. fdrm3
      15. fdrm1