Neural Architectures for Music Representation Learning

Sanghyuk Chun, Clova AI Research

Contents

- Understanding audio signals

- Front-end and back-end framework for audio architectures- Powerful front-end with Harmonic filter banks

- [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN.

- [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.

- [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.

- Interpretable back-end with self-attention mechanism- [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.

- [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.

- Conclusion

Understanding audio signalsRaw audioSpectrogramMel filter bank

Understanding audio signals in time domain.

[0.001, -0.002, -0.005, -0.004, -0.003, -0.003, -0.003, -0.002, -0.001, …]

“Waveform” shows “magnitude” of the input signal across time.How can we capture “frequency” information?

Understanding audio signals in frequency domain.

Time-amplitude => Frequency-amplitude

Understanding audio signals intime-frequency domain.

Types for audio inputs- Raw audio waveform- Linear spectrogram- Log-scale spectrogram- Mel spectrogram- Constant Q transform (CQT)

Human perception for audio: Log-scale.

220 Hz 440 Hz 880 Hz

293.66 Hz146.83 Hz 587.33 Hz

Mel filter banks: Log-scale filter bank.

Mel-spectrogram.

Raw audio Linear spectrogram Mel-spectrogram

1D: Sampling rate * audio length= 11025 * 30 = 330K

Input shape 2D: (# fft / 2 + 1) X # frames= (2048 / 2 + 1) X 1255= [1025, 1255]

Related to many hyperparams (hop size, window size, …)

2D: (# mel bins) X # frames= [128, 1255]

Informationdensity

Very sparse in time axis(need very large receptive field)

1 sec = sampling rate

Sparse in freq axis(need large receptive field)Each time bin has 1025 dim

Less sparse in freq axisEach time bin has 128 dim

loss No loss (if SR > Nyquist rate) “Resolution” “Resolution” + Mel filter

Bonus: MFCC (Mel-Frequency Cepstral Coefficient).

Mel-spectrogram

2D: (# mel bins) X # frames= [128, 1255]

DCT (Discrete Cosine Transform)

2D: (# mfcc) X # frames= [20, 1255]

Frequently used in the speech domain.Very lossy representation to the high-level music representation learning (c.f., SIFT)

Front-end and back-end frameworkFully convolutional neural network baselineRethinking convolutional neural networkFront-end and back-end framework

Fully convolutional neural network baseline for automatic music tagging.

11[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Rethinking CNN asfeature extractor and non-linear classifier.

Rethinking CNN as feature extractor and non-linear classifier.

Low-level feature(timbre, pitch)

High-level feature(rhythm, tempo)

Non-linearclassifier

Rethinking CNN asfeature extractor and non-linear classifier.

Front-end and back-end framework

Local feature extraction

Front-end

Temporal summarization& Classification

Back-end OutputFilter banks

Time-frequencyfeature extraction

Front-end

[ICASSP 2017] "Convolutional recurrent neural networks for music classification.”, Choi, et al.

CNN RNNMel-filter banks

Front-end

[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.

CNN CNNMel-filter banks

Front-end

[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.

Front-end

[SMC 2017] "Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms.”, Lee, et al.

CNN CNNFully-learnable

Front-end

[ICASSP 2020] "Data-driven Harmonic Filters for Audio Representation Learning.”, Won, et al.

CNN CNNPartially-learnable

Front-end

[ArXiv 2019] “Toward Interpretable Music Tagging with Self-attention.”, Won, et al.

CNN Self-attentionMel-filter banks

Front-end

CNN Self-attentionMel-filter banks

Powerful front-end with Harmonic filter banks[ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN.[ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning. [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.

Motivation: Data-driven, but human-guided

Front-end

Traditional methods Mel-filter banks MFCC SVM

Recentmethods Fully-learnable CNN CNN

Hand-crafted featureswith strong human prior

Data-driven approach without any human prior

Classification

DATA-DRIVEN HARMONIC FILTERS

Data-driven filter banks

f(m): pre-defined frequency values depending onSampling rate, FFT size, mel bin size, …

Proposed data-driven filter is parameterized by- fc: center frequency- BW: bandwidth

Data-driven filter banks

Trainable Q!

Derived from equivalent rectangular bandwidth (ERB)

DATA-DRIVEN HARMONIC FILTERS

Harmonic filters

n=1 n=2 n=3 n=4

Output of harmonic filters

n=1 n=2 n=3 n=4

Fundamental frequency 2nd harmonic of Fund freq 3rd harmonic of Fund freq 4th harmonic of Fund freq

Harmonic tensors

Harmonic CNN

Back-end

Experiments

Music tagging Keyword spotting Sound event detection

# data 21k audio clips 106k audio clips 53k audio clips

# classes 50 35 17

Task Multi-labeled Single-labeled Multi-labeled

Experiments

# classes 50 35 17

“techno”, “beat”, “no voice”,“fast”, “dance”, …

Many tags are highly related to harmonic structure,e.g., timbre, genre, instruments, mood, …

Experiments

# classes 50 35 17

“wow”

(other labels: “yes”, “no”, “one”, “four”, …)

Harmonic characteristic is well-known important featurefor speech recognition (MFCC)

Experiments

# classes 50 35 17

“Ambulance (siren)”, “Civil defense siren”

(other labels: “train horn”, “Car”, “Fire truck”…)

Non-music and non-verbal audio signals are expected to have “inharmonic” features

Experiments

Mel-spectrogram

Fully-learnable

Filters Front-end back-end

CNN CNN

Partially-learnable CNN CNN

Mel-spectrogram CNN Attention RNNLinear / Mel spec, MFCC Gated-CRNN

Effect of harmonic

Harmonic CNN is efficient and effective architecture for music representation

All models can be reproduced by the following repository https://github.com/minzwon/sota-music-tagging-models

Harmonic CNN is more generalizable to realistic noises than other methods

Interpretable back-end with self-attention mechanism[ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.[ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.

Recall: Front-end and back-end framework

Front-end

Recall: Front-end and back-end framework

Front-end

What’s happening in the back-end?

CNN-based back-end cannot capture “temporal property”.

Loose locality

Self-attention back-endto capture long-term relationship

Experiments (music tagging)

Self-attention back-end for better interpretability

Attention score visualization.

My network says it has a “quiet” tag.But why?

Attention score visualization.

Back-end as “sound” detector:Positive tags case

“Piano” “Drum”

Back-end as “sound” detector:Negative tags case

“No vocal” “Quiet”

Tag-wise heatmap contribution

Set the selected attention weight as 1, while others are set to 0

Confidence of “Quiet”

Confidence of “Loud”

Confidence of “Piano”

Confidence of “Flute”

Conclusion

Powerful front-end by Harmonic CNNInterpretable back-end by self-attention

Reference

- [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN,Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra

- [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra

- [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.Minz Won, Andres Ferraro Dmitry Bogdanov, Xavier Serra

- [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.Minz Won, Sanghyuk Chun, Xavier Serra

- [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.Minz Won, Sanghyuk Chun, Xavier Serra

Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using...

Transcript of Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using...

Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using...

Documents

Transcript of Neural Architectures for Music Representation Learning · [ISMIR 2016] "Automatic tagging using...

Neural Information Flow: Learning neural information ...

Neural network for Neural Networks

Neural Networks Neural Networks

SENIOR DESIGN MEEG401 – FALL 2003 The Automated Inspection of Paste Coated Electrodes Janka Fazekas Chris Haynes Sean Castorani Ryan Basch.

Summary of mini-demonstrators. Moodplay: a semantic mood-based collaborative jukebox Mathieu Barthet, György Fazekas, Alo Allik, James Woodbridge Moodplay.

Reaper - Michele Fazekas Tara Butters · 2007. 7. 11. · Title: Reaper - Michele Fazekas Tara Butters.pdf Author: cshafer Created Date: 1/31/2007 4:53:16 PM

Bede-Fazekas Ákos Impression of the global climate change on the ornamental plant usage in Hungary Acta Universitatis Sapientiae

Plant DNA barcoding - FapespChoosing a multi-locus barcoding system Are plants harder to barcode than animals?-- Fazekas & al. 2009 (Mol. Ecol. Resources 2009)Overview-- Fazekas &

Differentiation between non neural and neural

Languages Generated by Programmed Grammars with Various Graphs Jurgen Dassow, Benedek Nagy, Cristina Bibire, Madalina Barbaiani, Szilard Fazekas, Aidan.

OLA 2020accessola2.com/superconference2020/Sessions/3FriJan31/OLA 202… · Western Libraries UX Team • Matthew Barry, User Experience Librarian • Monica Fazekas, Student Engagement

Explainable Neural Computation via Stack Neural Module …openaccess.thecvf.com/content_ECCV_2018/papers/Ronghang_Hu_Explainable... · Explainable Neural Computationvia Stack Neural

Negative impact of climate change on the distribution of some conifers kos Bede-Fazekas

Gabor Fazekas -The Role of Trade Openness in the Chinese ...rjrs.ase.ro/wp-content/uploads/2017/03/V101/V1016.Fazekas.pdf · Gabor Fazekas -The Role of Trade Openness in the Chinese

By: James Phommaxay, Andrew Fazekas, and Nick Chase.

A Tutorial on Deep Learning for Music Information Retrieval Tutorial on Deep Learning for Music Information Retrieval Keunwoo Choi keunwoo.choi@qmul.ac.uk György Fazekas g.fazekas@qmul.ac.uk

Artificial Neural NetworksArtificial Neural Networksrailway.iust.ac.ir/files/rail/Booklet_Teach/neural_network_moaveni.pdf · Artificial Neural NetworksArtificial Neural Networks

1 Parameterized Modules for Classes and Extensible Functions Keunwoo Lee Craig Chambers University of Washington FOOL/WOOD workshop, Jan. 2006 /ke noo

DRM: Technology overview Keunwoo Lee CSE 590 SO 19 April 2005.

Dynamic Texture Detection Based on Motion Analysiscs.ait.ac.th/~mdailey/cvreadings/Fazekas-Texture.pdf · Dynamic Texture Detection Based on Motion Analysis ... tured, moving) background