Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using...

Post on 16-Jun-2020

0 views 0 download

Transcript of Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using...

Neural Architectures for Music Representation Learning

Sanghyuk Chun, Clova AI Research

Contents

1

- Understanding audio signals

- Front-end and back-end framework for audio architectures- Powerful front-end with Harmonic filter banks

- [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN.

- [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.

- [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.

- Interpretable back-end with self-attention mechanism- [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.

- [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.

- Conclusion

Understanding audio signalsRaw audioSpectrogramMel filter bank

2

Understanding audio signals in time domain.

3

[0.001, -0.002, -0.005, -0.004, -0.003, -0.003, -0.003, -0.002, -0.001, …]

“Waveform” shows “magnitude” of the input signal across time.How can we capture “frequency” information?

Understanding audio signals in frequency domain.

4

Time-amplitude => Frequency-amplitude

Understanding audio signals intime-frequency domain.

5

Types for audio inputs- Raw audio waveform- Linear spectrogram- Log-scale spectrogram- Mel spectrogram- Constant Q transform (CQT)

Human perception for audio: Log-scale.

6

220 Hz 440 Hz 880 Hz

293.66 Hz146.83 Hz 587.33 Hz

Mel filter banks: Log-scale filter bank.

7

Mel-spectrogram.

8

Raw audio Linear spectrogram Mel-spectrogram

1D: Sampling rate * audio length= 11025 * 30 = 330K

Input shape 2D: (# fft / 2 + 1) X # frames= (2048 / 2 + 1) X 1255= [1025, 1255]

Related to many hyperparams (hop size, window size, …)

2D: (# mel bins) X # frames= [128, 1255]

Informationdensity

Very sparse in time axis(need very large receptive field)

1 sec = sampling rate

Sparse in freq axis(need large receptive field)Each time bin has 1025 dim

Less sparse in freq axisEach time bin has 128 dim

loss No loss (if SR > Nyquist rate) “Resolution” “Resolution” + Mel filter

Bonus: MFCC (Mel-Frequency Cepstral Coefficient).

9

Mel-spectrogram

2D: (# mel bins) X # frames= [128, 1255]

DCT (Discrete Cosine Transform)

2D: (# mfcc) X # frames= [20, 1255]

Frequently used in the speech domain.Very lossy representation to the high-level music representation learning (c.f., SIFT)

Front-end and back-end frameworkFully convolutional neural network baselineRethinking convolutional neural networkFront-end and back-end framework

10

Fully convolutional neural network baseline for automatic music tagging.

11[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Rethinking CNN asfeature extractor and non-linear classifier.

12

Rethinking CNN as feature extractor and non-linear classifier.

13[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Low-level feature(timbre, pitch)

High-level feature(rhythm, tempo)

Non-linearclassifier

Rethinking CNN asfeature extractor and non-linear classifier.

14

Front-end and back-end framework

15

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

Front-end and back-end framework

16

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

[ICASSP 2017] "Convolutional recurrent neural networks for music classification.”, Choi, et al.

CNN RNNMel-filter banks

Front-end and back-end framework

17

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.

CNN RNNMel-filter banks

CNN CNNMel-filter banks

Front-end and back-end framework

18

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.

CNN RNNMel-filter banks

CNN CNNMel-filter banks

Front-end and back-end framework

19

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

[SMC 2017] "Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms.”, Lee, et al.

CNN RNNMel-filter banks

CNN CNNMel-filter banks

CNN CNNFully-learnable

Front-end and back-end framework

20

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

[ICASSP 2020] "Data-driven Harmonic Filters for Audio Representation Learning.”, Won, et al.

CNN RNNMel-filter banks

CNN CNNMel-filter banks

CNN CNNFully-learnable

CNN CNNPartially-learnable

Front-end and back-end framework

21

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

CNN RNNMel-filter banks

CNN CNNMel-filter banks

CNN CNNFully-learnable

[ArXiv 2019] “Toward Interpretable Music Tagging with Self-attention.”, Won, et al.

CNN CNNPartially-learnable

CNN Self-attentionMel-filter banks

Front-end and back-end framework

22

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

CNN RNNMel-filter banks

CNN CNNMel-filter banks

CNN CNNFully-learnable

CNN CNNPartially-learnable

CNN Self-attentionMel-filter banks

Powerful front-end with Harmonic filter banks[ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN.[ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning. [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.

23

Motivation: Data-driven, but human-guided

24

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

Traditional methods Mel-filter banks MFCC SVM

Recentmethods Fully-learnable CNN CNN

Hand-crafted featureswith strong human prior

Data-driven approach without any human prior

Classification

25

DATA-DRIVEN HARMONIC FILTERS

26

DATA-DRIVEN HARMONIC FILTERS

Data-driven filter banks

27

f(m): pre-defined frequency values depending onSampling rate, FFT size, mel bin size, …

Proposed data-driven filter is parameterized by- fc: center frequency- BW: bandwidth

Data-driven filter banks

28

Trainable Q!

Derived from equivalent rectangular bandwidth (ERB)

29

DATA-DRIVEN HARMONIC FILTERS

Harmonic filters

30

n=1 n=2 n=3 n=4

Output of harmonic filters

31

n=1 n=2 n=3 n=4

Fundamental frequency 2nd harmonic of Fund freq 3rd harmonic of Fund freq 4th harmonic of Fund freq

Harmonic tensors

32

Harmonic CNN

33

Back-end

34

Experiments

35

Music tagging Keyword spotting Sound event detection

# data 21k audio clips 106k audio clips 53k audio clips

# classes 50 35 17

Task Multi-labeled Single-labeled Multi-labeled

Experiments

36

Music tagging Keyword spotting Sound event detection

# data 21k audio clips 106k audio clips 53k audio clips

# classes 50 35 17

Task Multi-labeled Single-labeled Multi-labeled

“techno”, “beat”, “no voice”,“fast”, “dance”, …

Many tags are highly related to harmonic structure,e.g., timbre, genre, instruments, mood, …

Experiments

37

Music tagging Keyword spotting Sound event detection

# data 21k audio clips 106k audio clips 53k audio clips

# classes 50 35 17

Task Multi-labeled Single-labeled Multi-labeled

“wow”

(other labels: “yes”, “no”, “one”, “four”, …)

Harmonic characteristic is well-known important featurefor speech recognition (MFCC)

Experiments

38

Music tagging Keyword spotting Sound event detection

# data 21k audio clips 106k audio clips 53k audio clips

# classes 50 35 17

Task Multi-labeled Single-labeled Multi-labeled

“Ambulance (siren)”, “Civil defense siren”

(other labels: “train horn”, “Car”, “Fire truck”…)

Non-music and non-verbal audio signals are expected to have “inharmonic” features

Experiments

39

Mel-spectrogram

Fully-learnable

Filters Front-end back-end

CNN CNN

CNN CNN

Partially-learnable CNN CNN

Mel-spectrogram CNN Attention RNNLinear / Mel spec, MFCC Gated-CRNN

Effect of harmonic

40

Harmonic CNN is efficient and effective architecture for music representation

41

All models can be reproduced by the following repository https://github.com/minzwon/sota-music-tagging-models

Harmonic CNN is more generalizable to realistic noises than other methods

42

Interpretable back-end with self-attention mechanism[ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.[ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.

43

Recall: Front-end and back-end framework

44

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

Recall: Front-end and back-end framework

45

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

What’s happening in the back-end?

CNN-based back-end cannot capture “temporal property”.

46[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Loose locality

Self-attention back-endto capture long-term relationship

47

Experiments (music tagging)

48

Self-attention back-end for better interpretability

49

Self-attention back-end for better interpretability

50

Self-attention back-end for better interpretability

51

Attention score visualization.

52

My network says it has a “quiet” tag.But why?

Attention score visualization.

53

Back-end as “sound” detector:Positive tags case

54

“Piano” “Drum”

Back-end as “sound” detector:Negative tags case

55

“No vocal” “Quiet”

Tag-wise heatmap contribution

56

Set the selected attention weight as 1, while others are set to 0

Tag-wise heatmap contribution

57

Confidence of “Quiet”

Confidence of “Loud”

Tag-wise heatmap contribution

58

Confidence of “Piano”

Confidence of “Flute”

Conclusion

59

Powerful front-end by Harmonic CNNInterpretable back-end by self-attention

60

Reference

61

- [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN,Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra

- [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra

- [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.Minz Won, Andres Ferraro Dmitry Bogdanov, Xavier Serra

- [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.Minz Won, Sanghyuk Chun, Xavier Serra

- [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.Minz Won, Sanghyuk Chun, Xavier Serra