1 Machine learning for note onset detection. Alexandre Lacoste & Douglas Eck.

30
1 Machine learning for Machine learning for note onset detection. note onset detection. Alexandre Lacoste & Alexandre Lacoste & Douglas Eck Douglas Eck

Transcript of 1 Machine learning for note onset detection. Alexandre Lacoste & Douglas Eck.

11

Machine learning for note Machine learning for note onset detection. onset detection.

Alexandre Lacoste & Douglas Alexandre Lacoste & Douglas EckEck

22

OutlineOutline

What is note onset detection and What is note onset detection and why is it useful ?why is it useful ?

Small review of the fieldSmall review of the field The details of the incredible The details of the incredible

algorithmalgorithm Results of the contestResults of the contest Results of the custom datasetResults of the custom dataset

33

What are note onsets ?What are note onsets ?

Percussive instruments Percussive instruments are modeled as shown are modeled as shown (right)(right)

Basic definition : Basic definition :

Note onset is the time Note onset is the time where the slope is the where the slope is the highest, during the highest, during the attack time.attack time.

amplitu

de

time

44

More general definitionMore general definition

What happens if we have sounds that are not What happens if we have sounds that are not percussive ? (pitch changing, singing, vibrato percussive ? (pitch changing, singing, vibrato …)…)

Then we define onsets as being Then we define onsets as being unpredictable events. unpredictable events.

If, with information near in the past, we can’t If, with information near in the past, we can’t predict the future, then a new event just predict the future, then a new event just arrived. arrived.

This is the definition used to label the onsets.This is the definition used to label the onsets.

55

Onset detection is not trivialOnset detection is not trivial

In other words, percussive In other words, percussive note onsets in monophonic note onsets in monophonic songs is trivial.songs is trivial.

But if you want to make it But if you want to make it work for complex work for complex polyphonic with singing, it polyphonic with singing, it is another story. is another story.

66

What can we do with a good What can we do with a good note onset detector ?note onset detector ?

Not directly useful, but it is present in Not directly useful, but it is present in many music algorithms.many music algorithms.

Music transcription (from wave to midi)Music transcription (from wave to midi) Music editing (Song segmentation)Music editing (Song segmentation) Tempo tracking (with onset, finding the Tempo tracking (with onset, finding the

tempos is much easier)tempos is much easier) Musical fingerprinting (the onset trace Musical fingerprinting (the onset trace

can serve as a robust id for can serve as a robust id for fingerprinting)fingerprinting)

77

Scheirer’s Psycho-acoustical Scheirer’s Psycho-acoustical experimentexperiment

Scheirer showed that Scheirer showed that only the envelope of a only the envelope of a few frequency band few frequency band was important for the was important for the rhythmical rhythmical information.information.

By modulating the By modulating the envelopes with a envelopes with a noise source, the song noise source, the song can be rebuilt and can be rebuilt and almost no rhythmical almost no rhythmical aspect is lost.aspect is lost.

88

The Pre-Lacoste ModelThe Pre-Lacoste Model Most onset detection algorithms use Most onset detection algorithms use

Scheirer’s model and use a filter to find Scheirer’s model and use a filter to find positive slopes. For example : positive slopes. For example :

Then, they use a peak-picking algorithm to Then, they use a peak-picking algorithm to find the onset position.find the onset position.

This method is fast simple and works fine for This method is fast simple and works fine for monophonic percussive songs.monophonic percussive songs.

But it got very poor results on complex But it got very poor results on complex polyphonic with singing.polyphonic with singing.

And it is very sensitive to parameter And it is very sensitive to parameter adjustmentadjustment

1 ii II

99

The information is mainly local in The information is mainly local in timetime

Why not apply a simple feed-forward neural Why not apply a simple feed-forward neural network directly on all the inputs of the window.network directly on all the inputs of the window.

And just ask if there is an onset at this positionAnd just ask if there is an onset at this position Finally, we repeat this for every time step.Finally, we repeat this for every time step.

1010

The algorithm can be split in 3 The algorithm can be split in 3 main stepsmain steps

Get the spectrogram of Get the spectrogram of the songthe song

Convolve a feed-forward Convolve a feed-forward neural network across the neural network across the spectrogramspectrogram

Find the onset locationFind the onset location

1111

SPECTROGRAMSSPECTROGRAMS

Many different time-frequency Many different time-frequency representation might be useful for representation might be useful for this task. Let’s explore some of this task. Let’s explore some of them.them.

1.1. Short-time Fourier transform (STFT)Short-time Fourier transform (STFT)

2.2. Constant-Q transformConstant-Q transform

3.3. Phase plane of STFTPhase plane of STFT

1212

Short-time Fourier Short-time Fourier TransformTransform

The yellow curve represents the onset timeThe yellow curve represents the onset time

1313

Constant-Q TransformConstant-Q Transform

The constant-Q transform has a logarithmic The constant-Q transform has a logarithmic frequency scale which provides:frequency scale which provides: a much better frequency resolution for lower a much better frequency resolution for lower

frequency.frequency. a better time resolution for high frequency.a better time resolution for high frequency.

1414

Can we do something with the Can we do something with the phase plane ?phase plane ?

The phase plane, without any manipulation, The phase plane, without any manipulation, doesn’t seems to contain any information. doesn’t seems to contain any information.

1515

Phase AccelerationPhase Acceleration Bello and Sandler [1] have found a way to Bello and Sandler [1] have found a way to

use phase information for onset detection.use phase information for onset detection. They takes the principal argument of the They takes the principal argument of the

phase acceleration. phase acceleration. 2,1,,, 2princarg nknknknk

Patterns not evident enough !

1616

Phase frequency difference Phase frequency difference Instead, if we simply take the difference along the Instead, if we simply take the difference along the

frequency axis, we get interesting patterns.frequency axis, we get interesting patterns. nknknk ,1,, princarg

Results show performance equivalent to the magnitude plane, using only the phase.

1717

Feed Forward Neural Feed Forward Neural NetworkNetwork

Remember, the algorithm is simply the FNN convolved Remember, the algorithm is simply the FNN convolved across time and frequency.across time and frequency.

The target is a mixture of thin Gaussians that represents The target is a mixture of thin Gaussians that represents the expectation of having an onset for time t.the expectation of having an onset for time t.

1818

Net InputsNet Inputs For a decent spectrogram resolutionFor a decent spectrogram resolution

Time : 200 bins / sTime : 200 bins / s Frequency : 200 binsFrequency : 200 bins

And a window width of 50 msAnd a window width of 50 ms We have 2000 input variablesWe have 2000 input variables This is too many !!!This is too many !!! We randomly sample 200 variables inside We randomly sample 200 variables inside

the window.the window. Uniform distribution across frequencyUniform distribution across frequency Gaussian distribution across time (more Gaussian distribution across time (more

variables near the center)variables near the center)

1919

Net Structure and TrainingNet Structure and Training

Two hidden layersTwo hidden layers 20 units in the first layer20 units in the first layer 15 units in the second layer15 units in the second layer 1 output neuron1 output neuron

Learning algorithm : Learning algorithm : Polak-Ribiere version of conjugate gradient

K-fold cross-validation for performance estimation

2020

Net OutputNet Output

Most peaks are really sharp and there is very Most peaks are really sharp and there is very low background noise.low background noise.

Some peaks are smaller but still can be Some peaks are smaller but still can be detecteddetected

The precision is also very good.The precision is also very good.

2121

Peak-PickingPeak-Picking

The neural networks only The neural networks only emphasize the onsets.emphasize the onsets.

We now have to find the location We now have to find the location of each onset.of each onset.

We simply apply a threshold.We simply apply a threshold. positive crossing is the beginningpositive crossing is the beginning Negative crossing is the endNegative crossing is the end Location is the center of massLocation is the center of mass

The value of the threshold is The value of the threshold is learned by exhaustive search.learned by exhaustive search.

end

beginning

2222

F-measureF-measure

To maximize the To maximize the performance, we want to find performance, we want to find the maximum number of the maximum number of onsets (Recall)onsets (Recall)

But we also want to minimize But we also want to minimize the number of spurious onsets the number of spurious onsets (Precision)(Precision)

The F-measure offers an The F-measure offers an equilibrium between the two.equilibrium between the two.

found

cd

N

nP

targetN

nR cd

RP

PRF

2

2323

MIREX 2005 ResultsMIREX 2005 Results No other participants used machine learning.No other participants used machine learning. With a simple FNN, we have a huge With a simple FNN, we have a huge

performance boost.performance boost. We also have the best equilibrium between We also have the best equilibrium between

precision and recall.precision and recall.

2424

Custom DatasetCustom Dataset

For better tests, we built a custom For better tests, we built a custom dataset.dataset.

It is composed only of It is composed only of complex complex polyphonic songspolyphonic songs with singing. with singing.

There is in total 60 segments of 10 There is in total 60 segments of 10 seconds.seconds.

The onsets were all hand-labeled, The onsets were all hand-labeled, using a graphical user interface.using a graphical user interface.

2525

Results for Different Results for Different SpectrogramsSpectrograms

2626

Combining Phase and Combining Phase and Magnitude Does Not Help.Magnitude Does Not Help.

2727

Deceptively simpleDeceptively simple

Complex network Complex network structure does not structure does not helphelp

Very simple Very simple structure still gets structure still gets good performancegood performance

Only one neuron Only one neuron can get most of the can get most of the performanceperformance

1st layer

2nd layer

F-meas Valid

50 30 87±5

20 15 87±4

10 5 87±5

10 0 86±4

5 0 86±3

2 0 85±5

1 0 83±4

2828

ConclusionConclusion

Applying machine learning for the Applying machine learning for the onset detection problem is simple onset detection problem is simple and very efficient.and very efficient.

This provides an algorithm that is This provides an algorithm that is accurate and robust to a wide variety accurate and robust to a wide variety of songs.of songs.

It is not sensitive to hyper-parameter It is not sensitive to hyper-parameter adjustment. adjustment.

2929

Onset labeling GUIOnset labeling GUI

3030

Results for Different Results for Different SpectrogramsSpectrograms

Phase acceleration (Bello and Sandlers) Phase acceleration (Bello and Sandlers) is slightly better than noise.is slightly better than noise.

Phase frequency difference is almost as Phase frequency difference is almost as good as magnitude plane but highly good as magnitude plane but highly depends on the spectral window width.depends on the spectral window width.

Constant-Q and STFT give the best Constant-Q and STFT give the best results, provided the spectral window results, provided the spectral window width is small enough.width is small enough.