Deep learning for music classification, 2016-05-24
-
Upload
keunwoo-choi -
Category
Engineering
-
view
2.190 -
download
2
Transcript of Deep learning for music classification, 2016-05-24
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
Reference
Deep Learning for Music ClassificationGCT634 Spring 2016, KAIST
Centre for Digital Music, Queen Mary University of London, UK
24 May 2016
1/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
Reference
1 Music classification
2 Data-driven approachesConventional MLDeep Learning
3 Reference
2/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
Reference
Music Classification
Definition
Classify music items into certain categories(using audio content)
Genre classification [3]
Rock/Jazz/Hiphop/Classical/...
Instrument identification
Music/Speech segmentation
Emotion recognition
Automatic tagging
3/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
Reference
Music Classification
Feasibility
Are these info really extractable from audio signal?
Genre: sound, play style, chord, instrument, melody, ..
Instrument: spectral and/or temporal patterns
Music/Speech: spectral and/or temporal patterns
Emotion: sound/melody/lyrics/..
Tags (instrument/era/emotion/activity/): ...
4/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
ConventionalML
Deep Learning
Reference
Data-driven approachesConventional ML
Data-driven + domain knowledgeacoustic/musical features
[3]
”We provide candidates and let machine choose”
5/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
ConventionalML
Deep Learning
Reference
Data-driven approachesConventional ML
1/2 Feature selection
Any features that might be relevant to the classification
Spectral features
Spectral rolloff, centroid, MFCC, ZCR,
Rhythmic features
Tempo, beat histogram
Tonal features
Key, pitch-class distribution, tonality
6/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
ConventionalML
Deep Learning
Reference
Data-driven approachesConventional ML
2/2 Classifiers
Classifiers select relevant features
To map (aggregated N-dim feature) to (decision)
Classifiers are trained with data
After training, it is usually possible to score how relevanteach feature is
7/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
ConventionalML
Deep Learning
Reference
Data-driven approachesConventional ML - Genre classification example
audio signal → → , → →
length=N → 256-by-100 → 30-by-100, 1-by-100 → 31-by-100 → 62-by-1
for x , y in training data, # x: audio signal
1 X = stft(x)2 x mfccs = mfcc(X )2 x centroids = spectral centroid(X )3 x feats = concatenate(x mfccs, x centroids)
# size(x feats) = (31,100), feature vectors for every frame in the track
4 x feat = concatenate(mean(x feats), var(x feats))# size(x feat) = (62,1), feature vector of the whole track x
Training the classifier with (x feat, y)
* Now, we have a system that maps audio signal → genre
8/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
ConventionalML
Deep Learning
Reference
Data-driven approachesConventional ML - Genre classification example
audio signal → [feature extraction] → → [trained classifier] → yprediction
for new audio signal s,
1 get s feat2 predict genre of the signal!
9/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
ConventionalML
Deep Learning
Reference
Even more data-driven approachesDeep Learning
”Why don’t we optimise/automate more?”
Because designing features is NOT optimised (and boring)
”We provide candidates and let machine choose”
”Let machine design and choose features”
10/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
ConventionalML
Deep Learning
Reference
Even more data-driven approachesDeep Learning
Deep Learning
Deep == More layers (of Neural Networks) == Somelayers serve as feature extractors, the others as classifiers
11/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
ConventionalML
Deep Learning
Reference
Even more data-driven approachesDeep Learning
Deep LearningMachines might do better than humans
they don’t get bored, compute faster, are not biased,..
Machines are more flexible than before
learned classifier AND feature extractor
Machines need more examples to learn from than before
because the number of parameters to learn increases
Human: decides the structure and input types
12/15
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
ConventionalML
Deep Learning
Reference
Even more data-driven approachesDeep Learning
End-to-end learning for music audio, Sander Dieleman etal., ICASSP, 2014 [2]
Auto-tagging using deep convolutional neural networks,Keunwoo Choi et al., ISMIR, 2016 [1]
13/15
Choi, K., Fazekas, G., Sandler, M.: Automatic taggingusing deep convolutional neural networks. In: Proceedingsof the 17th International Society for Music InformationRetrieval Conference (ISMIR 2016), New York, USA (2016)
Dieleman, S., Schrauwen, B.: End-to-end learning formusic audio. In: Acoustics, Speech and Signal Processing(ICASSP), 2014 IEEE International Conference on. pp.6964–6968. IEEE (2014)
Tzanetakis, G., Cook, P.: Musical genre classification ofaudio signals. Speech and Audio Processing, IEEEtransactions on 10(5), 293–302 (2002)
Deep Learningfor Music
Classification
Musicclassification
Data-drivenapproaches
Reference
Bonus 1. 6 selected pages of this slide on deep CNNs
ConvolutionalNeural
Networks
Overview
CNN use-cases
References
Convolutional Neural NetworksA brief explanation
Centre for Digital Music, Queen Mary University of London, UK
1/43
14/15
ConvolutionalNeural
Networks
Overview
CNNs vs DNNs
CNN structures
Inside CNNs
CNN use-cases
References
CNNs: Convolutional Neural Networks
(Deep) Convolutional Neural Networks
deep = cascadedconvolutional = filtersneural networks = things are learned
1
2
1cns.org2AlexNet
3/43
ConvolutionalNeural
Networks
Overview
CNNs vs DNNs
CNN structures
Inside CNNs
CNN use-cases
References
Hierarchical features
Hierarchical feature learning
Each layer learns features in different levels of hierarchy
High-level features are built on low-level features
E.g.
Layer 1: Edges (low-level, concrete)Layer 2: Simple shapesLayer 3: Complex shapesLayer 4: More complex shapesLayer 5: Shapes of target objects (high-level, abstract)
26/43
ConvolutionalNeural
Networks
Overview
CNNs vs DNNs
CNN structures
Inside CNNs
CNN use-cases
References
What is learned in CNNs?in image recognition task
[11]
27/43
ConvolutionalNeural
Networks
Overview
CNNs vs DNNs
CNN structures
Inside CNNs
CNN use-cases
References
What is learned in CNNs?in image recognition task
[11]
28/43
ConvolutionalNeural
Networks
Overview
CNNs vs DNNs
CNN structures
Inside CNNs
CNN use-cases
References
What is learned in CNNs?in image recognition task
[11]
29/43
ConvolutionalNeural
Networks
Overview
CNN use-cases
Image
Music
References
CNN use-casesMusic information retrieval
Anything people can do by seeing spectrograms
E.g. Auto tagging [1], chord recognition [5], instrumentrecognition [7], music-noise segmentation [8], onsetdetection [9], boundary detection [10]
+ style change? source separation? effects/de-effects?
39/43
Bonus 2. 11 Selected pages of this slide on Auto-Taggingwith CNNs
AutomaticTagging using
DeepConvolutional
NeuralNetworks [1]
Introduction
CNNs andMusic
Problemdefinition
The proposedarchitecture
Experimentsanddiscussions
Conclusion
Reference
Automatic Tagging usingDeep Convolutional Neural Networks [1]
Centre for Digital Music, Queen Mary University of London, UK
1/22
AutomaticTagging using
DeepConvolutional
NeuralNetworks [1]
Introduction
CNNs andMusic
Problemdefinition
The proposedarchitecture
Experimentsanddiscussions
Conclusion
Reference
IntroductionTagging
Tags
Descriptive keywords that people put on music
Multi-label natureE.g. {rock, guitar, drive, 90’s}
Music tags include Genres (rock, pop, alternative, indie),Instruments (vocalists, guitar, violin), Emotions (mellow,chill), Activities (party, drive), Eras (00’s, 90’s, 80’s).
Collaboratively created (Last.fm ) → noisyfalse negativesynonyms (vocal/vocals/vocalist/vocalists/voice/voices.guitar/guitars)popularity biastypo (harpsicord)irrelevant tags (abcd, ilikeit, fav)
3/22
AutomaticTagging using
DeepConvolutional
NeuralNetworks [1]
Introduction
CNNs andMusic
TF-representations
ConvolutionKernels andAxes
Pooling
Problemdefinition
The proposedarchitecture
Experimentsanddiscussions
Conclusion
Reference
CNNs and MusicTF-representations
Options
STFT / Mel-spectrogram / CQT / raw-audio
STFT: Okay, but why not melgram?
Melgram: Efficient
CQT: only if you’re interested in fundamentals/pitchs
Raw-audio: end-to-end setup (learn the transformation),
have not outperformed melgram (yet) in speech/musicperhaps the way to go in the future?we lose frequency axis though
7/22
AutomaticTagging using
DeepConvolutional
NeuralNetworks [1]
Introduction
CNNs andMusic
Problemdefinition
The proposedarchitecture
Experimentsanddiscussions
Conclusion
Reference
Problem definition
Automatic tagging
Automatic tagging is a multi-label classification task
K -dim vector: up to 2K cases
Majority of tags is False (no matter it’s correct or not)
Measured by AUC-ROC
Area Under Curve of Receiver Operating Characteristics
1
1Image from Kaggle
10/22
AutomaticTagging using
DeepConvolutional
NeuralNetworks [1]
Introduction
CNNs andMusic
Problemdefinition
The proposedarchitecture
Experimentsanddiscussions
Conclusion
Reference
The proposed architecture
4-layer fully convolutional network, FCN-4
11/22
AutomaticTagging using
DeepConvolutional
NeuralNetworks [1]
Introduction
CNNs andMusic
Problemdefinition
The proposedarchitecture
Experimentsanddiscussions
Conclusion
Reference
The proposed architecture
FCN-5 FCN-6 FCN-7
Mel-spectrogram (input: 96×1366×1)
Conv 3×3×128
MP (2, 4) (output: 48×341×128)
Conv 3×3×256
MP (2, 4) (output: 24×85×256)
Conv 3×3×512
MP (2, 4) (output: 12×21×512)
Conv 3×3×1024
MP (3, 5) (output: 4×4×1024)
Conv 3×3×2048
MP (4, 4) (output: 1×1×2048)
· Conv 1×1×1024 Conv 1×1×1024· Conv 1×1×1024
Output 50×1 (sigmoid)
Table: The configurations of 5, 6, and 7-layer architectures. The onlydifferences are the number of additional 1×1 convolution layers.
12/22
AutomaticTagging using
DeepConvolutional
NeuralNetworks [1]
Introduction
CNNs andMusic
Problemdefinition
The proposedarchitecture
Experimentsanddiscussions
Overview
MagnaTagATune
Million SongDataset
Conclusion
Reference
Experiments and discussionsOverview
MTT MSD
# tracks 25k 1M
# songs 5-6k 1M
Length 29.1s 30-60s
Benchmarks 10+ 0
Labels Tags, genresTags, genres,EchoNest features,bag-of-word lyrics,...
13/22
AutomaticTagging using
DeepConvolutional
NeuralNetworks [1]
Introduction
CNNs andMusic
Problemdefinition
The proposedarchitecture
Experimentsanddiscussions
Overview
MagnaTagATune
Million SongDataset
Conclusion
Reference
Experiments and discussionsMagnaTagATune
Same depth (l=4), melgram>MFCC>STFTmelgram: 96 mel-frequency binsSTFT: 128 frequency binsMFCC: 90 (30 MFCC, 30 MFCCd, 30 MFCCdd)
Methods AUC
FCN-3, mel-spectrogram .852
FCN-4, mel-spectrogram .894FCN-5, mel-spectrogram .890
FCN-4, STFT .846
FCN-4, MFCC .862
Still, ConvNet may outperform frequency aggregation thanmel-frequency with more data. But not here.
ConvNet outperformed MFCC
15/22
AutomaticTagging using
DeepConvolutional
NeuralNetworks [1]
Introduction
CNNs andMusic
Problemdefinition
The proposedarchitecture
Experimentsanddiscussions
Overview
MagnaTagATune
Million SongDataset
Conclusion
Reference
Experiments and discussionsMagnaTagATune
Methods AUC
FCN-3, mel-spectrogram .852
FCN-4, mel-spectrogram .894FCN-5, mel-spectrogram .890
FCN-4, STFT .846
FCN-4, MFCC .862
FCN-4>FCN-3: Depth worked!
FCN-4>FCN-5 by .004
Deeper model might make it equal after ages of trainingDeeper models requires more dataDeeper models take more time (deep residual network[6])
4 layers are enough vs. matter of size(data)?
16/22
AutomaticTagging using
DeepConvolutional
NeuralNetworks [1]
Introduction
CNNs andMusic
Problemdefinition
The proposedarchitecture
Experimentsanddiscussions
Overview
MagnaTagATune
Million SongDataset
Conclusion
Reference
Experiments and discussionsMillion Song Dataset
Methods AUC
FCN-3, mel-spectrogram .786
FCN-4, — .808
FCN-5, — .848
FCN-6, — .851FCN-7, — .845
FCN-3<4<5<6 !
Deeper layers pay off
utill 6-layers in this case.
19/22
AutomaticTagging using
DeepConvolutional
NeuralNetworks [1]
Introduction
CNNs andMusic
Problemdefinition
The proposedarchitecture
Experimentsanddiscussions
Conclusion
Reference
Conclusion
2D fully convolutional networks work well
Mel-spectrogram can be preferred to STFT until
until we have a HUGE dataset so that mel-frequencyaggregation can be replaced
Bye bye, MFCC? In the near future, I guess
MIR can go deeper than now
if we have bigger, better, stronger datasets
Q. How do ConvNets actually deal with spectrograms?
A. Stay tuned to this year’s MLSP paper!
21/22