Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using...

62
Neural Architectures for Music Representation Learning Sanghyuk Chun, Clova AI Research

Transcript of Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using...

Page 1: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Neural Architectures for Music Representation Learning

Sanghyuk Chun, Clova AI Research

Page 2: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Contents

1

- Understanding audio signals

- Front-end and back-end framework for audio architectures- Powerful front-end with Harmonic filter banks

- [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN.

- [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.

- [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.

- Interpretable back-end with self-attention mechanism- [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.

- [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.

- Conclusion

Page 3: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Understanding audio signalsRaw audioSpectrogramMel filter bank

2

Page 4: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Understanding audio signals in time domain.

3

[0.001, -0.002, -0.005, -0.004, -0.003, -0.003, -0.003, -0.002, -0.001, …]

“Waveform” shows “magnitude” of the input signal across time.How can we capture “frequency” information?

Page 5: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Understanding audio signals in frequency domain.

4

Time-amplitude => Frequency-amplitude

Page 6: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Understanding audio signals intime-frequency domain.

5

Types for audio inputs- Raw audio waveform- Linear spectrogram- Log-scale spectrogram- Mel spectrogram- Constant Q transform (CQT)

Page 7: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Human perception for audio: Log-scale.

6

220 Hz 440 Hz 880 Hz

293.66 Hz146.83 Hz 587.33 Hz

Page 8: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Mel filter banks: Log-scale filter bank.

7

Page 9: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Mel-spectrogram.

8

Raw audio Linear spectrogram Mel-spectrogram

1D: Sampling rate * audio length= 11025 * 30 = 330K

Input shape 2D: (# fft / 2 + 1) X # frames= (2048 / 2 + 1) X 1255= [1025, 1255]

Related to many hyperparams (hop size, window size, …)

2D: (# mel bins) X # frames= [128, 1255]

Informationdensity

Very sparse in time axis(need very large receptive field)

1 sec = sampling rate

Sparse in freq axis(need large receptive field)Each time bin has 1025 dim

Less sparse in freq axisEach time bin has 128 dim

loss No loss (if SR > Nyquist rate) “Resolution” “Resolution” + Mel filter

Page 10: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Bonus: MFCC (Mel-Frequency Cepstral Coefficient).

9

Mel-spectrogram

2D: (# mel bins) X # frames= [128, 1255]

DCT (Discrete Cosine Transform)

2D: (# mfcc) X # frames= [20, 1255]

Frequently used in the speech domain.Very lossy representation to the high-level music representation learning (c.f., SIFT)

Page 11: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Front-end and back-end frameworkFully convolutional neural network baselineRethinking convolutional neural networkFront-end and back-end framework

10

Page 12: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Fully convolutional neural network baseline for automatic music tagging.

11[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Page 13: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Rethinking CNN asfeature extractor and non-linear classifier.

12

Page 14: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Rethinking CNN as feature extractor and non-linear classifier.

13[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Low-level feature(timbre, pitch)

High-level feature(rhythm, tempo)

Non-linearclassifier

Page 15: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Rethinking CNN asfeature extractor and non-linear classifier.

14

Page 16: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Front-end and back-end framework

15

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

Page 17: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Front-end and back-end framework

16

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

[ICASSP 2017] "Convolutional recurrent neural networks for music classification.”, Choi, et al.

CNN RNNMel-filter banks

Page 18: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Front-end and back-end framework

17

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.

CNN RNNMel-filter banks

CNN CNNMel-filter banks

Page 19: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Front-end and back-end framework

18

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.

CNN RNNMel-filter banks

CNN CNNMel-filter banks

Page 20: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Front-end and back-end framework

19

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

[SMC 2017] "Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms.”, Lee, et al.

CNN RNNMel-filter banks

CNN CNNMel-filter banks

CNN CNNFully-learnable

Page 21: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Front-end and back-end framework

20

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

[ICASSP 2020] "Data-driven Harmonic Filters for Audio Representation Learning.”, Won, et al.

CNN RNNMel-filter banks

CNN CNNMel-filter banks

CNN CNNFully-learnable

CNN CNNPartially-learnable

Page 22: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Front-end and back-end framework

21

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

CNN RNNMel-filter banks

CNN CNNMel-filter banks

CNN CNNFully-learnable

[ArXiv 2019] “Toward Interpretable Music Tagging with Self-attention.”, Won, et al.

CNN CNNPartially-learnable

CNN Self-attentionMel-filter banks

Page 23: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Front-end and back-end framework

22

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

CNN RNNMel-filter banks

CNN CNNMel-filter banks

CNN CNNFully-learnable

CNN CNNPartially-learnable

CNN Self-attentionMel-filter banks

Page 24: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Powerful front-end with Harmonic filter banks[ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN.[ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning. [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.

23

Page 25: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Motivation: Data-driven, but human-guided

24

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

Traditional methods Mel-filter banks MFCC SVM

Recentmethods Fully-learnable CNN CNN

Hand-crafted featureswith strong human prior

Data-driven approach without any human prior

Classification

Page 26: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

25

DATA-DRIVEN HARMONIC FILTERS

Page 27: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

26

DATA-DRIVEN HARMONIC FILTERS

Page 28: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Data-driven filter banks

27

f(m): pre-defined frequency values depending onSampling rate, FFT size, mel bin size, …

Proposed data-driven filter is parameterized by- fc: center frequency- BW: bandwidth

Page 29: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Data-driven filter banks

28

Trainable Q!

Derived from equivalent rectangular bandwidth (ERB)

Page 30: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

29

DATA-DRIVEN HARMONIC FILTERS

Page 31: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Harmonic filters

30

n=1 n=2 n=3 n=4

Page 32: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Output of harmonic filters

31

n=1 n=2 n=3 n=4

Fundamental frequency 2nd harmonic of Fund freq 3rd harmonic of Fund freq 4th harmonic of Fund freq

Page 33: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Harmonic tensors

32

Page 34: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Harmonic CNN

33

Page 35: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Back-end

34

Page 36: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Experiments

35

Music tagging Keyword spotting Sound event detection

# data 21k audio clips 106k audio clips 53k audio clips

# classes 50 35 17

Task Multi-labeled Single-labeled Multi-labeled

Page 37: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Experiments

36

Music tagging Keyword spotting Sound event detection

# data 21k audio clips 106k audio clips 53k audio clips

# classes 50 35 17

Task Multi-labeled Single-labeled Multi-labeled

“techno”, “beat”, “no voice”,“fast”, “dance”, …

Many tags are highly related to harmonic structure,e.g., timbre, genre, instruments, mood, …

Page 38: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Experiments

37

Music tagging Keyword spotting Sound event detection

# data 21k audio clips 106k audio clips 53k audio clips

# classes 50 35 17

Task Multi-labeled Single-labeled Multi-labeled

“wow”

(other labels: “yes”, “no”, “one”, “four”, …)

Harmonic characteristic is well-known important featurefor speech recognition (MFCC)

Page 39: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Experiments

38

Music tagging Keyword spotting Sound event detection

# data 21k audio clips 106k audio clips 53k audio clips

# classes 50 35 17

Task Multi-labeled Single-labeled Multi-labeled

“Ambulance (siren)”, “Civil defense siren”

(other labels: “train horn”, “Car”, “Fire truck”…)

Non-music and non-verbal audio signals are expected to have “inharmonic” features

Page 40: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Experiments

39

Mel-spectrogram

Fully-learnable

Filters Front-end back-end

CNN CNN

CNN CNN

Partially-learnable CNN CNN

Mel-spectrogram CNN Attention RNNLinear / Mel spec, MFCC Gated-CRNN

Page 41: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Effect of harmonic

40

Page 42: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Harmonic CNN is efficient and effective architecture for music representation

41

All models can be reproduced by the following repository https://github.com/minzwon/sota-music-tagging-models

Page 43: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Harmonic CNN is more generalizable to realistic noises than other methods

42

Page 44: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Interpretable back-end with self-attention mechanism[ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.[ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.

43

Page 45: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Recall: Front-end and back-end framework

44

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

Page 46: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Recall: Front-end and back-end framework

45

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Input

What’s happening in the back-end?

Page 47: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

CNN-based back-end cannot capture “temporal property”.

46[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Loose locality

Page 48: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Self-attention back-endto capture long-term relationship

47

Page 49: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Experiments (music tagging)

48

Page 50: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Self-attention back-end for better interpretability

49

Page 51: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Self-attention back-end for better interpretability

50

Page 52: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Self-attention back-end for better interpretability

51

Page 53: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Attention score visualization.

52

My network says it has a “quiet” tag.But why?

Page 54: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Attention score visualization.

53

Page 55: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Back-end as “sound” detector:Positive tags case

54

“Piano” “Drum”

Page 56: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Back-end as “sound” detector:Negative tags case

55

“No vocal” “Quiet”

Page 57: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Tag-wise heatmap contribution

56

Set the selected attention weight as 1, while others are set to 0

Page 58: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Tag-wise heatmap contribution

57

Confidence of “Quiet”

Confidence of “Loud”

Page 59: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Tag-wise heatmap contribution

58

Confidence of “Piano”

Confidence of “Flute”

Page 60: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Conclusion

59

Page 61: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Powerful front-end by Harmonic CNNInterpretable back-end by self-attention

60

Page 62: Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Reference

61

- [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN,Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra

- [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra

- [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.Minz Won, Andres Ferraro Dmitry Bogdanov, Xavier Serra

- [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.Minz Won, Sanghyuk Chun, Xavier Serra

- [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.Minz Won, Sanghyuk Chun, Xavier Serra