A Neural Network Alternative to Non-negative Audio Models · • Multi-layer NAEs outperform NMF...

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN A Neural Network Alternative

to Non-negative Audio Models Paris Smaragdis#*

Shrikant Venkataramani*

*University of Illinois at Urbana Champaign #Adobe Research

ICASSP 2017

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

Motivation• Supervised single channel source separation

• Using models trained from clean sounds

• Two dominant approaches • Non-negative Matrix Factorization (NMF)

• Reusable and interpretable models • Deep learning

• State of the art results, Non-transferable models

• Neural network formulation of NMF models • Maintaining reusability, taking advantage of deep-

learning structures

2

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

Transferable Models• Being able to plug-in trained models

3

H

NMF

DNN

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN • Learning spectral bases from spectrograms.

Learning an NMF model

4

Magnitude Spectrogram

Spectral Bases

Activations

X = W ·H X,W,H 2 R+

W

H

X

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

NMF in action• Analyzing piano notes

5

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

NMF as an Auto-encoder

6

X

W⇤ W

bX = WH

s.t.H � 0

s.t.W � 0

H = W⇤X

NMF Non-negative Auto-encoder (NAE)X = W ·H H = W⇤ ·X such that H � 0

bX = W ·H such that W � 0

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

Removing Non-negativity constraints

• Enforcing non-negativity is cumbersome. • Non-negative layer outputs

• Results in a magnitude spectrogram at the output • Results in a non-negative activation at hidden layer • Can be enforced with an activation function

7

g(x) = max(x, 0) or |x| or ln(1 + e

x

)

X

H = g(W⇤X)

bX = g(WH)

W⇤ W

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

NAE in action• Bases are not guaranteed to be non-negative

• Results in dense activations • Adding sparsity to hidden layer output

• Results in sparse activations • Intuitive bases and activations

8

KL(X||g(W ·H)) KL(X||g(W ·H)) + �||H||1

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

Advantages of NAE• Difficult to extend NMF models

• Easy to do with Neural nets. • LSTMs, CNNs, Multi-layer networks etc

9

X

bX

W⇤

W

H

Recurrent NAE

X

bX

W⇤1

W⇤2

W2

W1

H1

H2

Multi-layer NAE

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

NMF source separation• Spectrogram of the mixture is the sum of source

spectrograms.

10

Estimate Hx1 ,Hx2 such that,

X = W1

·Hx1 +W

2

·Hx2

X1

= W1

·Hx1

X2

= W2

·Hx2

Estimate models for

each source

Explain the contribution of the source in

unknown mixture using trained models Training data for source 1

Training data for source 2

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

NAE source separation

11

Estimate models for

each source

Explain the contribution of the source in

unknown mixture using trained models



UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

NAE source separation• Goal: Estimating network inputs instead of the

weights

• Gradient-descent/back-propagation to train the network

• Separated spectrograms

12

X = g(W1

·Hx1) + g(W

2

·Hx2)

X1

= g(W1

·Hx1)

X2

= g(W2

·Hx2)

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

Evaluation• Two-speaker mixtures

• Training data ~ 20-25 seconds • Test data: Single sentence of known speakers

• Evaluation metrics • BSS_eval metrics (SDR, SIR, SAR) • STOI (intelligibility measure)

• Compared multilayer and shallow versions • With multiple ranks (number of hidden units)

13

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

NMF vs NAE• Number of NAE layers = 2 • Number of hidden units = 20

14

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN


15

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN


16

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN


17

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

Shallow vs Multi-layer NAE• Shallow NAEs

give comparable performance over all ranks

• Multi-layer NAEs require higher ranks

18

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

Conclusions• NAE models can replace NMF

• This allows us to generalize to complex structures

• NAE models superior to NMF models • Shallow NAEs comparable to NMF • Multi-layer NAEs outperform NMF significantly

• Future directions • Incorporating exotic neural models (LSTMs, CNNs etc)

19

UN

IVE

RS

ITY

OF

IL

LIN

OIS

AT

UR

BA

NA

-CH

AM

PA

IGN

THANK YOU

20

A Neural Network Alternative to Non-negative Audio Models · • Multi-layer NAEs outperform NMF...

Documents

Transcript of A Neural Network Alternative to Non-negative Audio Models · • Multi-layer NAEs outperform NMF...