1 Machine learning for note onset detection. Alexandre Lacoste & Douglas Eck.
-
Upload
kenneth-kerry-ball -
Category
Documents
-
view
220 -
download
0
Transcript of 1 Machine learning for note onset detection. Alexandre Lacoste & Douglas Eck.
11
Machine learning for note Machine learning for note onset detection. onset detection.
Alexandre Lacoste & Douglas Alexandre Lacoste & Douglas EckEck
22
OutlineOutline
What is note onset detection and What is note onset detection and why is it useful ?why is it useful ?
Small review of the fieldSmall review of the field The details of the incredible The details of the incredible
algorithmalgorithm Results of the contestResults of the contest Results of the custom datasetResults of the custom dataset
33
What are note onsets ?What are note onsets ?
Percussive instruments Percussive instruments are modeled as shown are modeled as shown (right)(right)
Basic definition : Basic definition :
Note onset is the time Note onset is the time where the slope is the where the slope is the highest, during the highest, during the attack time.attack time.
amplitu
de
time
44
More general definitionMore general definition
What happens if we have sounds that are not What happens if we have sounds that are not percussive ? (pitch changing, singing, vibrato percussive ? (pitch changing, singing, vibrato …)…)
Then we define onsets as being Then we define onsets as being unpredictable events. unpredictable events.
If, with information near in the past, we can’t If, with information near in the past, we can’t predict the future, then a new event just predict the future, then a new event just arrived. arrived.
This is the definition used to label the onsets.This is the definition used to label the onsets.
55
Onset detection is not trivialOnset detection is not trivial
In other words, percussive In other words, percussive note onsets in monophonic note onsets in monophonic songs is trivial.songs is trivial.
But if you want to make it But if you want to make it work for complex work for complex polyphonic with singing, it polyphonic with singing, it is another story. is another story.
66
What can we do with a good What can we do with a good note onset detector ?note onset detector ?
Not directly useful, but it is present in Not directly useful, but it is present in many music algorithms.many music algorithms.
Music transcription (from wave to midi)Music transcription (from wave to midi) Music editing (Song segmentation)Music editing (Song segmentation) Tempo tracking (with onset, finding the Tempo tracking (with onset, finding the
tempos is much easier)tempos is much easier) Musical fingerprinting (the onset trace Musical fingerprinting (the onset trace
can serve as a robust id for can serve as a robust id for fingerprinting)fingerprinting)
77
Scheirer’s Psycho-acoustical Scheirer’s Psycho-acoustical experimentexperiment
Scheirer showed that Scheirer showed that only the envelope of a only the envelope of a few frequency band few frequency band was important for the was important for the rhythmical rhythmical information.information.
By modulating the By modulating the envelopes with a envelopes with a noise source, the song noise source, the song can be rebuilt and can be rebuilt and almost no rhythmical almost no rhythmical aspect is lost.aspect is lost.
88
The Pre-Lacoste ModelThe Pre-Lacoste Model Most onset detection algorithms use Most onset detection algorithms use
Scheirer’s model and use a filter to find Scheirer’s model and use a filter to find positive slopes. For example : positive slopes. For example :
Then, they use a peak-picking algorithm to Then, they use a peak-picking algorithm to find the onset position.find the onset position.
This method is fast simple and works fine for This method is fast simple and works fine for monophonic percussive songs.monophonic percussive songs.
But it got very poor results on complex But it got very poor results on complex polyphonic with singing.polyphonic with singing.
And it is very sensitive to parameter And it is very sensitive to parameter adjustmentadjustment
1 ii II
99
The information is mainly local in The information is mainly local in timetime
Why not apply a simple feed-forward neural Why not apply a simple feed-forward neural network directly on all the inputs of the window.network directly on all the inputs of the window.
And just ask if there is an onset at this positionAnd just ask if there is an onset at this position Finally, we repeat this for every time step.Finally, we repeat this for every time step.
1010
The algorithm can be split in 3 The algorithm can be split in 3 main stepsmain steps
Get the spectrogram of Get the spectrogram of the songthe song
Convolve a feed-forward Convolve a feed-forward neural network across the neural network across the spectrogramspectrogram
Find the onset locationFind the onset location
1111
SPECTROGRAMSSPECTROGRAMS
Many different time-frequency Many different time-frequency representation might be useful for representation might be useful for this task. Let’s explore some of this task. Let’s explore some of them.them.
1.1. Short-time Fourier transform (STFT)Short-time Fourier transform (STFT)
2.2. Constant-Q transformConstant-Q transform
3.3. Phase plane of STFTPhase plane of STFT
1212
Short-time Fourier Short-time Fourier TransformTransform
The yellow curve represents the onset timeThe yellow curve represents the onset time
1313
Constant-Q TransformConstant-Q Transform
The constant-Q transform has a logarithmic The constant-Q transform has a logarithmic frequency scale which provides:frequency scale which provides: a much better frequency resolution for lower a much better frequency resolution for lower
frequency.frequency. a better time resolution for high frequency.a better time resolution for high frequency.
1414
Can we do something with the Can we do something with the phase plane ?phase plane ?
The phase plane, without any manipulation, The phase plane, without any manipulation, doesn’t seems to contain any information. doesn’t seems to contain any information.
1515
Phase AccelerationPhase Acceleration Bello and Sandler [1] have found a way to Bello and Sandler [1] have found a way to
use phase information for onset detection.use phase information for onset detection. They takes the principal argument of the They takes the principal argument of the
phase acceleration. phase acceleration. 2,1,,, 2princarg nknknknk
Patterns not evident enough !
1616
Phase frequency difference Phase frequency difference Instead, if we simply take the difference along the Instead, if we simply take the difference along the
frequency axis, we get interesting patterns.frequency axis, we get interesting patterns. nknknk ,1,, princarg
Results show performance equivalent to the magnitude plane, using only the phase.
1717
Feed Forward Neural Feed Forward Neural NetworkNetwork
Remember, the algorithm is simply the FNN convolved Remember, the algorithm is simply the FNN convolved across time and frequency.across time and frequency.
The target is a mixture of thin Gaussians that represents The target is a mixture of thin Gaussians that represents the expectation of having an onset for time t.the expectation of having an onset for time t.
1818
Net InputsNet Inputs For a decent spectrogram resolutionFor a decent spectrogram resolution
Time : 200 bins / sTime : 200 bins / s Frequency : 200 binsFrequency : 200 bins
And a window width of 50 msAnd a window width of 50 ms We have 2000 input variablesWe have 2000 input variables This is too many !!!This is too many !!! We randomly sample 200 variables inside We randomly sample 200 variables inside
the window.the window. Uniform distribution across frequencyUniform distribution across frequency Gaussian distribution across time (more Gaussian distribution across time (more
variables near the center)variables near the center)
1919
Net Structure and TrainingNet Structure and Training
Two hidden layersTwo hidden layers 20 units in the first layer20 units in the first layer 15 units in the second layer15 units in the second layer 1 output neuron1 output neuron
Learning algorithm : Learning algorithm : Polak-Ribiere version of conjugate gradient
K-fold cross-validation for performance estimation
2020
Net OutputNet Output
Most peaks are really sharp and there is very Most peaks are really sharp and there is very low background noise.low background noise.
Some peaks are smaller but still can be Some peaks are smaller but still can be detecteddetected
The precision is also very good.The precision is also very good.
2121
Peak-PickingPeak-Picking
The neural networks only The neural networks only emphasize the onsets.emphasize the onsets.
We now have to find the location We now have to find the location of each onset.of each onset.
We simply apply a threshold.We simply apply a threshold. positive crossing is the beginningpositive crossing is the beginning Negative crossing is the endNegative crossing is the end Location is the center of massLocation is the center of mass
The value of the threshold is The value of the threshold is learned by exhaustive search.learned by exhaustive search.
end
beginning
2222
F-measureF-measure
To maximize the To maximize the performance, we want to find performance, we want to find the maximum number of the maximum number of onsets (Recall)onsets (Recall)
But we also want to minimize But we also want to minimize the number of spurious onsets the number of spurious onsets (Precision)(Precision)
The F-measure offers an The F-measure offers an equilibrium between the two.equilibrium between the two.
found
cd
N
nP
targetN
nR cd
RP
PRF
2
2323
MIREX 2005 ResultsMIREX 2005 Results No other participants used machine learning.No other participants used machine learning. With a simple FNN, we have a huge With a simple FNN, we have a huge
performance boost.performance boost. We also have the best equilibrium between We also have the best equilibrium between
precision and recall.precision and recall.
2424
Custom DatasetCustom Dataset
For better tests, we built a custom For better tests, we built a custom dataset.dataset.
It is composed only of It is composed only of complex complex polyphonic songspolyphonic songs with singing. with singing.
There is in total 60 segments of 10 There is in total 60 segments of 10 seconds.seconds.
The onsets were all hand-labeled, The onsets were all hand-labeled, using a graphical user interface.using a graphical user interface.
2727
Deceptively simpleDeceptively simple
Complex network Complex network structure does not structure does not helphelp
Very simple Very simple structure still gets structure still gets good performancegood performance
Only one neuron Only one neuron can get most of the can get most of the performanceperformance
1st layer
2nd layer
F-meas Valid
50 30 87±5
20 15 87±4
10 5 87±5
10 0 86±4
5 0 86±3
2 0 85±5
1 0 83±4
2828
ConclusionConclusion
Applying machine learning for the Applying machine learning for the onset detection problem is simple onset detection problem is simple and very efficient.and very efficient.
This provides an algorithm that is This provides an algorithm that is accurate and robust to a wide variety accurate and robust to a wide variety of songs.of songs.
It is not sensitive to hyper-parameter It is not sensitive to hyper-parameter adjustment. adjustment.
3030
Results for Different Results for Different SpectrogramsSpectrograms
Phase acceleration (Bello and Sandlers) Phase acceleration (Bello and Sandlers) is slightly better than noise.is slightly better than noise.
Phase frequency difference is almost as Phase frequency difference is almost as good as magnitude plane but highly good as magnitude plane but highly depends on the spectral window width.depends on the spectral window width.
Constant-Q and STFT give the best Constant-Q and STFT give the best results, provided the spectral window results, provided the spectral window width is small enough.width is small enough.