A PREPRINT - arXiv

12
G UITAR E FFECTS R ECOGNITION AND PARAMETER E STIMATION WITH C ONVOLUTIONAL N EURAL N ETWORKS APREPRINT Marco Comunità, Dan Stowell, Joshua D. Reiss {m.comunita, dan.stowell, joshua.reiss}@qmul.ac.uk Centre for Digital Music Queen Mary University of London UK December 8, 2020 ABSTRACT Despite the popularity of guitar effects, there is very little existing research on classification and parameter estimation of specific plugins or effect units from guitar recordings. In this paper, con- volutional neural networks were used for classification and parameter estimation for 13 overdrive, distortion and fuzz guitar effects. A novel dataset of processed electric guitar samples was assembled, with four sub-datasets consisting of monophonic or polyphonic samples and discrete or continuous settings values, for a total of about 250 hours of processed samples. Results were compared for networks trained and tested on the same or on a different sub-dataset. We found that discrete datasets could lead to equally high performance as continuous ones, whilst being easier to design, analyse and modify. Classification accuracy was above 80%, with confusion matrices reflecting similarities in the effects timbre and circuits design. With parameter values between 0.0 and 1.0, the mean absolute error is in most cases below 0.05, while the root mean square error is below 0.1 in all cases but one. Keywords Guitar effects classification · Guitar effects parameters · Audio effects · Deep learning 1 Introduction In music composition, production and engineering, audio effects play an essential role in altering the sound towards the desired final result. For instruments like the electric guitar, the processing signal chain can often be viewed as part of the artist’s creative expression [1]. Entire musical genres and styles are frequently defined and identified by the type of audio effects adopted [2, 3]; and renowned musicians commonly rely on specific combinations of guitars, amplifiers and effects to achieve a unique sound [4]. Through the decades, artists, engineers and producers have defined a palette of sounds that have become a reference for guitar players. In the effort to recreate a specific sound or atmosphere, professionals and amateurs go to great lengths to identify the exact gear that was used in a certain recording. When describing a desired result, people often rely on naming a reference style, artist or song, rather than talking in terms of sound features or effect parameters. Although the design, reconstruction and emulation of audio effects is well-studied [5, 6]; it is less so for their recognition and their parameter estimation. Therefore, in our work, we set out to develop a deep learning model capable of recognising which specific guitar pedal effect was used in a recording as well as estimating its parameters. Being the single most important effect for electric guitar, and the one that usually triggers most discussions, this work focused on nonlinear effects, i.e. overdrive, distortion and fuzz. The rest of this paper is organised as follows: in section 2 we look at the state of the art and work related to guitar effects recognition, section 3 introduces the dataset we assembled for this work, while the networks architecture is described in section 4; sections 5, 6 and 7 outline experiments, results and conclusions. arXiv:2012.03216v1 [cs.SD] 6 Dec 2020

Transcript of A PREPRINT - arXiv

Page 1: A PREPRINT - arXiv

GUITAR EFFECTS RECOGNITION AND PARAMETER ESTIMATIONWITH CONVOLUTIONAL NEURAL NETWORKS

A PREPRINT

Marco Comunità, Dan Stowell, Joshua D. Reiss{m.comunita, dan.stowell, joshua.reiss}@qmul.ac.uk

Centre for Digital MusicQueen Mary University of London

UK

December 8, 2020

ABSTRACT

Despite the popularity of guitar effects, there is very little existing research on classification andparameter estimation of specific plugins or effect units from guitar recordings. In this paper, con-volutional neural networks were used for classification and parameter estimation for 13 overdrive,distortion and fuzz guitar effects. A novel dataset of processed electric guitar samples was assembled,with four sub-datasets consisting of monophonic or polyphonic samples and discrete or continuoussettings values, for a total of about 250 hours of processed samples. Results were compared fornetworks trained and tested on the same or on a different sub-dataset. We found that discrete datasetscould lead to equally high performance as continuous ones, whilst being easier to design, analyse andmodify. Classification accuracy was above 80%, with confusion matrices reflecting similarities in theeffects timbre and circuits design. With parameter values between 0.0 and 1.0, the mean absoluteerror is in most cases below 0.05, while the root mean square error is below 0.1 in all cases but one.

Keywords Guitar effects classification · Guitar effects parameters · Audio effects · Deep learning

1 Introduction

In music composition, production and engineering, audio effects play an essential role in altering the sound towards thedesired final result. For instruments like the electric guitar, the processing signal chain can often be viewed as part ofthe artist’s creative expression [1]. Entire musical genres and styles are frequently defined and identified by the type ofaudio effects adopted [2, 3]; and renowned musicians commonly rely on specific combinations of guitars, amplifiersand effects to achieve a unique sound [4]. Through the decades, artists, engineers and producers have defined a paletteof sounds that have become a reference for guitar players. In the effort to recreate a specific sound or atmosphere,professionals and amateurs go to great lengths to identify the exact gear that was used in a certain recording. Whendescribing a desired result, people often rely on naming a reference style, artist or song, rather than talking in terms ofsound features or effect parameters.

Although the design, reconstruction and emulation of audio effects is well-studied [5, 6]; it is less so for theirrecognition and their parameter estimation. Therefore, in our work, we set out to develop a deep learning model capableof recognising which specific guitar pedal effect was used in a recording as well as estimating its parameters. Being thesingle most important effect for electric guitar, and the one that usually triggers most discussions, this work focused onnonlinear effects, i.e. overdrive, distortion and fuzz.

The rest of this paper is organised as follows: in section 2 we look at the state of the art and work related to guitareffects recognition, section 3 introduces the dataset we assembled for this work, while the networks architecture isdescribed in section 4; sections 5, 6 and 7 outline experiments, results and conclusions.

arX

iv:2

012.

0321

6v1

[cs

.SD

] 6

Dec

202

0

Page 2: A PREPRINT - arXiv

A PREPRINT - DECEMBER 8, 2020

2 Background

The recognition of musical instrument sounds has been of interest to the information retrieval community for longtime [7]; with applications in musical sounds databases, intelligent music search and recommendation systems. Theestimation of instruments’ parameters, as well as the classification of playing gestures or styles has also been extensivelystudied - and applied in contexts like automatic music transcription [8, 9], music education [10] or musicology [11].

Many papers focused on guitar parameters. String, fret and plucking position estimation have been extensively studied,including works on classical and acoustic [12] and electric [13] guitar. Most works are based on features extractionfrom recorded sounds [12, 13, 14, 9, 15], but there are also examples of estimation based on guitar-string physicalmodels [16, 17] for high-tempo and real-time applications. Plucking and pickup position estimation has also beenstudied, with Mohamad et al exploring solutions based on spectral features (comparing recordings with string models)and autocorrelation [18]. The study was also extended to the case of nonlinear audio effects in the signal chain [19].

Classification of playing styles and techniques has also received substantial attention. In [20], the authors compare theperformance of several classifiers (support vector machine (SVM), Gaussian mixture models, nearest neighbours) on 5plucking styles (e.g. finger, pick, slap) from bass guitar recordings. In [21] Schuller et al extend the work to includeexpression styles (e.g. bend, slide, vibrato). Su et al, in [22], apply sparse dictionary learning to classify guitar playingtechniques (e.g. vibrato, hammer-on, pull-off). The same classification problem is solved in [23] using a deep beliefnetwork.

Other examples of work on guitar-related classification problems focused on playing mode (e.g. bass, solo melodicimprovisation, chordal playing) [24], chords fingering [25] and guitar model [26, 27, 28, 29].

However, there is only a small corpus of research on guitar effects recognition [30, 31, 32, 33].

In [30], Stein worked with guitar and bass recordings on recognition of 11 different effects: feedback and slapbackdelay, reverb, chorus, flanger, phaser, tremolo, vibrato, distortion and overdrive. In a subsequent work [31], the authorextended his method - based on spectral, cepstral and harmonic features and SVMs - to cascaded effects. In [33],using the same dataset, the authors aimed to understand which are the most relevant features for the classification taskadopting a ”bag-of-audio-words” approach; while in [32], the authors - making use of specific input test signals, andextracting features from the output - worked on classification of 10 analogue effect units into 5 categories. In [34], theguitar amplifier modelling process includes emulation of nonlinear blocks and estimation of parameters. In these lastexamples, the approach is limited to the case in which the unit to classify/model is accessible, which defeats the purposeof estimation from recordings.

The closest study to our work is [35], and it is the only one that estimates effects parameters from guitar recordings.Similarly to the previous studies, the authors used SVMs to classify the same effects listed above with a reduced featuresset; and extended the work by training shallow neural networks on parameter estimations for 3 effects (distortion,tremolo, delay). The main limitation of this study is the necessity of separate features selection and network training foreach effect.

In all these cases, the authors worked on generic audio effects or categories and, to the best of our knowledge, there isno previous research focusing on classification and parameter estimation of specific plugins or effect units from guitarrecordings.

3 Dataset

We assembled a novel dataset of processed electric guitar samples using unprocessed recordings from the IDMT-SMT-Audio-Effects dataset [30]. The dataset 1 includes monophonic (624 single notes) and polyphonic (420 intervals andchords) recordings (wav - 44.1kHz, 16bit, mono) from 2 different electric guitars, each with two pick-up settings andup to 3 plucking styles. The monophonic recordings cover the common pitch range of a 6-string electric guitar, and thepolyphonic samples were obtained mixing single notes recordings to generate two-notes intervals and 3- or 4-noteschords. All samples are 2 seconds long. The monophonic recordings required removal of background noise before thenote onset, which we obtained using a python script together with Librosa’s [36] onset detection function.

To assemble our dataset we selected 13 overdrive, distortion and fuzz plug-ins (see Table 1) designed to emulate someof the most iconic and widely used analogue guitar effect pedals. All the plugins have 2 or 3 controls and, regardlessof the specific name adopted by the designer, the controls can be identified by their processing function: Level, Gain,Tone/Equalisation.

1https://www.idmt.fraunhofer.de/en/business_units/m2d/smt/audio_effects.html

2

Page 3: A PREPRINT - arXiv

A PREPRINT - DECEMBER 8, 2020

Designer Plugin Emulation of Id

Audified

Ibanez TS808 808Ibanez TS9 TS9Boss BD2 BD2Boss OD1 OD1

MultiDrive Boss SD1 SD1Pedal Pro Boss DS1 DS1

ProCo Rat RATMXR Distortion+ DPLArbiter Fuzz Face FFC

EH Big Muff BMFMercuriall Greed Smasher Mesa/Boogie Grid Slammer MGS

AnalogObsession

Pig Pie EH Russian Big Muff RBMZupaa Vox Tone Bender VTB

Table 1: plugins

Id Level Gain Tone/Eq808 [1.0] [0.2, 0.5, 0.8, 1.0] [0.0, 0.2, 0.5, 0.8, 1.0]TS9 [1.0] [0.2, 0.5, 0.8, 1.0] [0.0, 0.2, 0.5, 0.8, 1.0]BD2 [1.0] [0.2, 0.5, 0.8, 1.0] [0.0, 0.2, 0.5, 0.8, 1.0]OD1 [1.0] [0.2, 0.5, 0.8, 1.0] —SD1 [1.0] [0.2, 0.5, 0.8, 1.0] [0.0, 0.2, 0.5, 0.8, 1.0]DS1 [1.0] [0.2, 0.5, 0.8, 1.0] [0.0, 0.2, 0.5, 0.8, 1.0]RAT [1.0] [0.2, 0.5, 0.8, 1.0] [0.0, 0.2, 0.5, 0.8, 1.0]DPL [1.0] [0.2, 0.5, 0.8, 1.0] —FFC [1.0] [0.0, 0.2, 0.5, 0.8, 1.0] —BMF [1.0] [0.2, 0.5, 0.8, 1.0] [0.0, 0.2, 0.5, 0.8, 1.0]MGS [1.0] [0.2, 0.5, 0.8, 1.0] [0.0, 0.2, 0.5, 0.8, 1.0]RBM [1.0] [0.2, 0.5, 0.8, 1.0] [0.0, 0.2, 0.5, 0.8, 1.0]VTB [1.0] [0.1, 0.2, 0.5, 0.8, 1.0] —

Table 2: settings

For training and testing purposes, 4 sub-datasets were generated, which will be referred to as Mono Discrete, PolyDiscrete, Mono Continuous and Poly Continuous.

The first two subsets (Mono Discrete, Poly Discrete) use a discrete set of combinations selected as the most commonand representative settings a person might use: Gain = [0.0, 0.1, 0.2, 0.5, 0.8, 1.0], Tone/Eq = [0.0, 0.2, 0.5, 0.8, 1.0].Also, since the Level control has no effect on the output timbre it was set to 1.0 for every combination. A summaryof the controls and settings is shown in Table 2. Most plugins do not include Gain values below 0.2 - this is becausefor such values there is no audible change between input and output or even a level attenuation (with no output forGain = 0.0). Every monophonic and polyphonic sample was processed with all the combinations, generating a total of∼200000 processed samples (∼120000 monophonic, ∼80000 polyphonic), for a total of about 110 hours.

For the second two subsets (Mono Continuous, Poly Continuous), both unprocessed samples as well as settings’ valueswere drawn from a uniform distribution. We generated 10000 random samples for each plugin, obtaining a total of260000 samples (130000 monophonic, 130000 polyphonic), equivalent to about 140 hours. Settings values were limitedto fall between the extremes shown in Table 2.

Generating these four subsets we aimed at gaining a deeper understanding about the generalisation capabilities of ourmodels. The samples’ were processed in MATLAB - making use of its VST plugin host features - and both unprocessedinputs and processed outputs were normalised to -6dBFS.

4 Architecture

Our neural network architecture (Table 3) is based on a combination of 2 convolutional and 3 fully connected layers,with batch normalisation layers at each hidden level. Except for the output layers’ size and activation functions, thesame configuration was used to train networks on both effects classification and settings estimation. Four differentnetworks - and training paradigms - were implemented:

• effects classification network (FxNet)• settings estimation network (SetNet)• multitask classification and estimation network (MultiNet) - where the 2 convolutional layers are shared.

3

Page 4: A PREPRINT - arXiv

A PREPRINT - DECEMBER 8, 2020

Layer Size #Fmaps ActivationConvolution 2D 5x5 6 LinearBatch Normalisation - - -Activation - - ReLUMax Pooling 2x2 - -Convolution 2D 5x5 12 LinearBatch Normalisation - - -Activation - - ReLUMax Pooling 2x2 - -Fully Connected 120 - LinearBatch Normalisation - - -Activation - - ReLUFully Connected 60 - LinearBatch Normalisation - - -Activation - - ReLUFully Connected (1) - (2)(1) FxNet = #Plug-ins - SetNet = #Settings(2) FxNet = Linear - SetNet = Tanh

Table 3: architecture

• settings estimation conditional network (SetNetCond) - where an extra embedding layer is added to conditionthe estimation on the effect class.

The loss functions adopted for the classification and estimation problems were, respectively, the cross-entropy loss andthe mean square error (MSE). To evaluate the settings estimation networks, we also defined an accuracy metric forwhich a prediction is considered correct when the absolute error for every parameter is less then 0.1. For effects that donot have a Tone/Eq control, we represented the absence using a value of -1.0 for the prediction. The threshold of 0.1was chosen to simplify the comparison of networks performance in different training settings and on different datasets.The value is based on the authors’ experience and informal listening tests during the dataset creation phase. However,parameters’ sensitivity varies between effects and controls, and differences in absolute value often do not relate linearlywith perceptual differences. To overcome this limitations we do rely also on mean absolute error (MAE) and root meansquare error (RMSE) to evaluate our models.

As input features to all our networks we used mel power-spectrograms, extracted from audio in 23 ms Hann windowswith 50% overlap. In total, 128 mel-bands were used in the 0–22,050 Hz range. For a given 2 s audio input, the featureextraction produced a T x 128 output (T = 87). These features are motivated by human auditory perception, and are acommon choice in acoustic scene classification [37]. Due to the good performance of our models, we did not deemimportant for this study to test other features, but it could be worth exploring in the future.

5 Experiments

Some preliminary experiments were conducted to obtain baseline performances for the settings estimation problem andto compare the results when using a multitask approach or a conditional network. The literature shows how multitasklearning can be effective at solving related tasks [38] and more efficient than training several networks. In the multitaskparadigm, the network was trained to classify a sample and estimate its settings at the same time. In the conditionalparadigm the networks were trained in "series", with the classification network (FxNet) used to condition the settingsestimation network (SetNetCond). The experiments were conducted on the Mono Discrete dataset.

All networks were trained for 50 epochs, which resulted in the following test accuracy:

• SetNet = 40.3%

• MultiNet = 40.88% (87.0% classification accuracy, 44.6% estimation accuracy)

• FxNet + SetNetCond = 57.3% (89.7% classification accuracy, 60.7% estimation accuracy)

The results show no appreciable difference in effect classification accuracy between the multitask and the conditionalparadigms but do show an impact on the settings estimation accuracy, with the conditional network performing better.Further experiments were therefore centred on the analysis of the classification network (FxNet) and the conditionalestimation network (SetNetCond) when trained/tested on the 4 different sub-datasets. In the following section we

4

Page 5: A PREPRINT - arXiv

A PREPRINT - DECEMBER 8, 2020

illustrate the results of training the networks for 100 epochs with an early stop when the validation accuracy sees noimprovement for 15 epochs. For weights update we opted for the Adam optimiser [39] with a fixed learning rate of0.001.

6 Results

6.1 Effects Classification

Table 4 shows the accuracy results for the effect classification problem. Note that the test accuracy is higher for networkstrained on continuous datasets, respectively 90.9% for the Mono Continuous dataset and 91.4% for the Poly Continuousdataset. At the same time, networks trained on discrete datasets performed better when tested on continuous onesthan the opposite condition (networks trained on continuous and tested on discrete datasets): 83.1% vs 81.3% formonophonic samples and 89.4% vs 84.1% for polyphonic ones.

TrainTest Mono Disc. Mono Cont. Poly Disc. Poly Cont.

Mono Disc. 86.3 83.1 — —Mono Cont. 81.3 90.9 — —Poly Disc. — — 88.4 89.4Poly Cont. — — 84.1 91.4

Table 4: FxNet accuracy (%)

By analysing the confusion matrices, we can gain a better understanding about the networks’ performance as well as thechallenges behind the classification. Figure 1 shows the details for networks trained on discrete datasets. About 10% ofthe errors in both datasets are due to the misclassification between 808 and TS9. The plugins are emulations of twooverdrive effects from the same manufacturer (Ibanez TS808 and TS9). The two effects are supposed to have similargain and frequency response, but, upon studying the circuits schematics, we noticed how there is actually no differencebetween the two. Therefore - assuming the plugins are faithful models of the analogue circuits - it is plausible thatour network would confuse samples from either effects. An explanation for the misclassification imbalance betweenmonophonic and polyphonic samples might be related to the training procedure. We used batches of 100 samples,randomly selected across the whole dataset and without control for batch to dataset ratio over the 13 classes. Assumingidentical plugins, the network will tend to classify samples from either as belonging to the class that was seen first ormore during training.

Similar observations are valid for errors in classifying OD1 and SD1. Again, the plugins are emulations of effect pedalsfrom the same manufacturer, and in their original analogue version use very similar designs. The two share the sameclipping circuit; although, while the SD1 includes a tone control that combines a treble boosting first order shelvingfilter with a first order lowpass, the OD1 has no tone control and a fixed first order lowpass. Analysing the classificationerrors for the Mono Discrete dataset, we noticed how, of the 345 times the OD1 is classified as SD1: 31% of the timesis when the Gain control is set to 0.2, and another 31% when Gain = 0.5. Ignoring the specific note being played, thisresult might be consequence of low gain settings, where the spectral differences might be too small. On the other hand,the SD1 is confused for the OD1 88 times, and all cases are from samples where the Tone is set to 0, 0.2 or 0.5. For lowTone values, the spectral differences might be unnoticeable, most of the high frequency harmonics might be filtered. Inthis case we did not observe correlation with the Gain control values.

Another interesting example, more noticeable for the Mono Discrete dataset, is the misclassification of the DPL asRAT. In this case, we are referring to effects from different manufacturers which do share some circuit design choices.Although the DPL does not have a tone control, the two circuits use similar clipping stages and the same maximumgain. But, they differ in the filtering after the clipping section and in the type of clipping diodes: germanium in the DPL,silicon in the RAT; with the first type determining a softer clipping. Looking at the results we noticed how 72% of thetimes the DPL is classified as RAT, the Gain control was set to 1.0. On the other hand, 60% of the times the RAT wasclassified as DPL, the Gain was set to 0.5 and the Tone (a first order lowpass) to 0 (no high frequency attenuation). Thismight be related to the clipping diodes, a low gain with silicon diodes might be comparable to the softer clipping ofgermanium diodes.

It is also relevant to observe that 808, TS9, MGS and SD1, despite being based on the same circuit, with similar clippingsections and tone controls, are almost never confused.

To analyse the performance of our classifiers trained on continuous datasets and tested on discrete ones, we refer toFigure 2. It can be noticed how the errors are similar to the previous cases, although, a major impact on the performance

5

Page 6: A PREPRINT - arXiv

A PREPRINT - DECEMBER 8, 2020

808

BD2

BMF

DPL

DS1

FFC

MGS

OD1

RAT

RBM

SD1

TS9

VTB

Predicted label

808

BD2

BMF

DPL

DS1

FFC

MGS

OD1

RAT

RBM

SD1

TS9

VTB

Truelabel

0.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.59 0.00

0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.01 0.79 0.00 0.00 0.01 0.00 0.19 0.01 0.00 0.00 0.00

0.00 0.00 0.01 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.01 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.00 0.02 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.28 0.00 0.00 0.71 0.00 0.00

0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.00 0.00 0.95 0.00 0.00

0.45 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.54 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

FxNet: train = Mono Discrete - test = Mono Discrete

808

BD2

BMF

DPL

DS1 FF

CMGS

OD1

RAT

RBM

SD1

TS9

VTB

Predicted label

808

BD2

BMF

DPL

DS1

FFC

MGS

OD1

RAT

RBM

SD1

TS9

VTB

Truelabel

0.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.00

0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.97 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.83 0.00 0.00 0.17 0.00 0.00

0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 0.00 0.00 0.93 0.00 0.00

0.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

FxNet: train = Poly Discrete - test = Poly Discrete

Figure 1: confusion matrices for discrete datasets

808

BD2

BMF

DPL

DS1

FFC

MGS

OD1

RAT

RBM

SD1

TS9

VTB

Predicted label

808

BD2

BMF

DPL

DS1

FFC

MGS

OD1

RAT

RBM

SD1

TS9

VTB

Truelabel

0.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00

0.00 0.37 0.03 0.00 0.42 0.00 0.01 0.00 0.01 0.00 0.02 0.00 0.13

0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.01 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.94 0.00 0.00 0.06 0.00 0.00

0.00 0.02 0.00 0.06 0.00 0.00 0.01 0.00 0.91 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.06 0.00 0.00 0.93 0.00 0.00

0.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99

FxNet: train = Mono Continuous - test = Mono Discrete

808

BD2

BMF

DPL

DS1 FF

CMGS

OD1

RAT

RBM

SD1

TS9

VTB

Predicted label

808

BD2

BMF

DPL

DS1

FFC

MGS

OD1

RAT

RBM

SD1

TS9

VTB

Truelabel

0.13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.87 0.00

0.00 0.57 0.18 0.00 0.04 0.00 0.01 0.00 0.00 0.04 0.12 0.03 0.00

0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.94 0.01 0.00 0.00 0.05 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.95 0.00 0.00 0.04 0.00 0.00

0.00 0.01 0.00 0.01 0.00 0.00 0.01 0.00 0.98 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.95 0.02 0.00

0.13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.87 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

FxNet: train = Poly Continuous - test = Poly Discrete

Figure 2: confusion matrices for test on discrete datasets

is due to the misclassification of BD2. In this case, we could not identify correlations between the different circuitdesigns and/or the controls values.

6.2 Settings Estimation

In this section we present the results for the settings estimation problem using the conditional network (SetNetCond)conditioned on the effect class ground truth. Table 5 shows the accuracy results, where all settings are estimated with anerror below 0.1. Networks trained and tested on polyphonic datasets are the best performing, probably due to richerinformation content of the spectra with respect to monophonic samples. For both monophonic and polyphonic samples,

6

Page 7: A PREPRINT - arXiv

A PREPRINT - DECEMBER 8, 2020

TrainTest Mono Disc. Mono Cont. Poly Disc. Poly Cont.

Mono Dis. 80.06 68.56 — —Mono Cont. 68.51 85.14 — —Poly Disc. — — 90.75 75.74Poly Cont. — — 88.93 97.01

Table 5: SetNetCond accuracy (%)

Train Set Gain Tone/Eq Test SetMAE (RMSE) MAE (RMSE)

Mono Disc. 0.030 (0.061) 0.039 (0.070) Mono Disc.0.064 (0.084) 0.044 (0.080) Mono Cont.

Mono Cont. 0.062 (0.096) 0.067 (0.108) Mono Disc.0.033 (0.045) 0.039 (0.072) Mono Cont.

Poly Disc. 0.017 (0.033) 0.024 (0.047) Poly Cont.0.055 (0.070) 0.038 (0.062) Poly Cont.

Poly Cont. 0.036 (0.063) 0.036 (0.062) Poly Disc.0.020 (0.028) 0.019 (0.034) Poly Cont.

Avg 0.040 (0.060) 0.038 (0.066)Table 6: SetNetCond errors

the networks trained and tested on continuous settings reach higher accuracy than their counterparts trained and testedon discrete settings. This is somewhat surprising since the estimation of discrete values was expected to be a simplerproblem to solve.

Further insights are offered by Table 6, where we show mean absolute error (MAE) and root mean square error (RMSE)for the different training and testing configurations. In the majority of cases (12 out of 16) the MAE is below 0.05and for all cases except one, the RMSE is below 0.1. We obtained the lowest errors when training and testing onpolyphonic samples. The highest errors result from training on monophonic continuous and discrete samples and testingrespectively on discrete and continuous ones. The table also includes the average errors for the Gain and Tone/Eqcontrols; the two are comparable, which shows that they present similar difficulty to the estimation. Figure 3 shows thebox-plots for the best and worst cases highlighted in Table 6.

Moreover, we wanted to analyse the generalisation capabilities of the networks trained on discrete settings, as well asthe performance of networks trained on continuous settings on discrete ones. Figure 4 shows the scatter plots for thenetwork trained on the Mono Discrete dataset when tested on the Mono Continuous dataset. For the Gain estimation wenotice a bias towards the discrete values seen during training; also, the network manages to interpolate and estimatecontinuous values, but seems to do it satisfactorily only for Gain values above 0.5. The Tone estimation seems to beless affected, and the output approximates fairly well the input uniform distribution. An explanation for this differencemight reside in the fact that for the Tone control we chose a balanced set of discrete values ([0.0, 0.2, 0.5, 0.8, 1.0]). Forthe Gain control this was not possible. As explained in section 3, most distortion effects do not produce any perceivabletimbral difference for low gain values and some introduce attenuation.

The scatter plots for the network trained on the Mono Continuous and tested on the Mono Discrete datasets (Figure 5)show some interesting behaviour. Tested on a discrete dataset, the network fails to maintain the performance at the samelevels as on the continuous one. In particular, we notice a higher variance, especially for Gain = [0.2, 0.5] and for mostof the Tone values. Also, a skew in the estimations’ distributions is visible. To analyse these in more details, Figure 6and 7 show the mean error and skew as a function of Gain and Tone for those networks trained on discrete datasetsand tested on continuous ones and vice-versa. Except for the case of Gain values below 0.1 in Figure 7a (train on PolyDiscrete and test on Poly Continuous), all mean errors for tests on opposite datasets are below 0.1. With the sameexception, the mean errors for training on discrete datasets and test on continuous ones are lower than 0.05 (Figures 6aand 7a). The skew for training on discrete and test on continuous datasets (Figures 6b and 7b) is in many cases lowerthan the inverse conditions (Figures 6b, 6d and 7b, 7d).

To conclude, even if the networks perform better on continuous datasets, there seems to be an argument for consideringdiscrete values. A dataset which uses discrete values for the independent variables is easier to design, control, analyseand to extend or reduce. In our specific application, an estimation error within 0.1 of the target value or some bias isacceptable. Also, the use of balanced values or a form of regularisation in the cost function could help the interpolationin case of estimation on continuous values or unseen data.

7

Page 8: A PREPRINT - arXiv

A PREPRINT - DECEMBER 8, 2020

Gain Tone/Eq

Control

−0.20

−0.15

−0.10

−0.05

0.00

0.05

0.10

0.15

0.20

Error

train = Poly Disc.test = Poly Disc.

Gain Tone/Eq

Control

−0.20

−0.15

−0.10

−0.05

0.00

0.05

0.10

0.15

0.20

Error

train = Poly Cont.test = Poly Cont.

Gain Tone/Eq

Control

−0.20

−0.15

−0.10

−0.05

0.00

0.05

0.10

0.15

0.20

Error

train = Mono Disc.test = Mono Cont.

Gain Tone/Eq

Control

−0.20

−0.15

−0.10

−0.05

0.00

0.05

0.10

0.15

0.20

Error

train = Mono Cont.test = Mono Disc.

Figure 3: settings estimation errors

7 Conclusions and Future Work

In this paper we introduced a CNN-based model for the classification of guitar effects and estimation of their parameters.Using electric guitar recordings of single notes, 2-notes intervals and 3- or 4- notes chords; we were able to classify withhigh accuracy samples processed with 13 overdrive, distortion and fuzz plugins. The plugins we used are emulationsof some of the most famous and commonly used analogue guitar effect pedals. We were also able to estimate theparameters settings with high precision. For this work, we generated a dataset of processed monophonic and polyphonicsamples; the plugins’ settings were either combinations of discrete values commonly used by musicians or continuousvalues randomly extracted from a uniform distribution. This allowed us to gain further understanding about theperformance on datasets generated using discrete or continuous independent variables. We described the benefits ofdiscrete datasets and postulated about the possibility of obtaining equally high performances.

To the best of our knowledge, our work is the first attempt at both classification and parameter estimation of specificguitar effect units. Being able to extrapolate information about the specific signal chain used in a recording couldbenefit music production; artists, engineers and producers often rely on "reference" sounds from other recordings.This knowledge could also be applied to automatic music transcription [8], music education [10] or musicology [11].Musicians, genres and styles are identified by and associated with specific sounds and effects; this could be applied tointelligent music search and recommendation systems.

Future work might branch out in different directions: our model could be compared with human performance; thearchitecture and datasets could be extended for higher accuracy and better generalisation on unseen data; and the researchcould be extended to other nonlinear units, different categories of effects, or tested on other guitars or instruments.

8 Links

Code:

• https://github.com/mcomunita/gfx_classifier

• https://github.com/mcomunita/gfx_classifier_models_and_results

Dataset:

• Mono Cont. - https://doi.org/10.5281/zenodo.4296040

8

Page 9: A PREPRINT - arXiv

A PREPRINT - DECEMBER 8, 2020

• Mono Disc. - https://doi.org/10.5281/zenodo.4298000• Poly Cont. - https://doi.org/10.5281/zenodo.4298017• Poly Disc. - https://doi.org/10.5281/zenodo.4298025

9 Acknowledgement

This study has been funded by UKRI and EPSRC as part of the “UKRI Centre for Doctoral Training in ArtificialIntelligence and Music”, under grant EP/S022694/1.

References

[1] Alex Case. Recording electric guitar—the science and the myth. Journal of the Audio Engineering Society,58(1/2):80–83, 2010.

[2] John Blair. Southern California surf music, 1960-1966. Arcadia Publishing, 2015.[3] Sean Williams. Tubby’s dub style: The live art of record production. 2012.[4] Pete Prown and Lisa Sharken. Gear secrets of the guitar legends: how to sound like your favorite players. Hal

Leonard Corporation, 2003.

Figure 4: scatter plots for settings estimation on continuous datasets

Figure 5: scatter plots for settings estimation on discrete datasets

9

Page 10: A PREPRINT - arXiv

A PREPRINT - DECEMBER 8, 2020

0.0 0.2 0.4 0.6 0.8 1.0

Target

−0.15

−0.10

−0.05

0.00

0.05

0.10

0.15

MeanError

train = Mono Disc.test = Mono Cont.

Gain

Tone

(a)

0.0 0.2 0.4 0.6 0.8 1.0

Target

−4

−2

0

2

4

Skew

train = Mono Disc.test = Mono Cont.

Gain

Tone

(b)

0.0 0.2 0.4 0.6 0.8 1.0

Target

−0.15

−0.10

−0.05

0.00

0.05

0.10

0.15

MeanError

train = Mono Cont.test = Mono Disc.

Gain

Tone

(c)

0.0 0.2 0.4 0.6 0.8 1.0

Target

−4

−2

0

2

4

Skew

train = Mono Cont.test = Mono Disc.

Gain

Tone

(d)

Figure 6: mean error and skew for networks trained/tested on monophonic samples

0.0 0.2 0.4 0.6 0.8 1.0

Target

−0.15

−0.10

−0.05

0.00

0.05

0.10

0.15

MeanError

train = Poly Disc.test = Poly Cont.

Gain

Tone

0.0 0.2 0.4 0.6 0.8 1.0

Target

−4

−2

0

2

4

Skew

train = Poly Disc.test = Poly Cont.

Gain

Tone

0.0 0.2 0.4 0.6 0.8 1.0

Target

−0.15

−0.10

−0.05

0.00

0.05

0.10

0.15

MeanError

train = Poly Cont.test = Poly Disc.

Gain

Tone

0.0 0.2 0.4 0.6 0.8 1.0

Target

−4

−2

0

2

4

Skew

train = Poly Cont.test = Poly Disc.

Gain

Tone

Figure 7: mean error and skew for networks trained/tested on polyphonic samples

[5] Udo Zölzer. DAFX: digital audio effects. John Wiley & Sons, 2011.

[6] Jyri Pakarinen and David T Yeh. A review of digital techniques for modeling vacuum-tube guitar amplifiers.Computer Music Journal, 33(2):85–100, 2009.

[7] Perfecto Herrera-Boyer, Geoffroy Peeters, and Shlomo Dubnov. Automatic classification of musical instrumentsounds. Journal of New Music Research, 32(1):3–21, 2003.

[8] Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri. Automatic musictranscription: challenges and future directions. Journal of Intelligent Information Systems, 41(3):407–434, 2013.

[9] Christian Kehling, Jakob Abeßer, Christian Dittmar, and Gerald Schuller. Automatic tablature transcription ofelectric guitar recordings by estimation of score-and instrument-related parameters. In DAFx, pages 219–226,2014.

[10] Christian Dittmar, Estefanía Cano, Jakob Abeßer, and Sascha Grollmisch. Music information retrieval meetsmusic education. In Dagstuhl Follow-Ups, volume 3. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2012.

[11] Jakob Abeßer, Klaus Frieler, Estefanía Cano, Martin Pfleiderer, and Wolf-Georg Zaddach. Score-informed analysisof tuning, intonation, pitch modulation, and dynamics in jazz solos. IEEE/ACM Transactions on Audio, Speech,and Language Processing, 25(1):168–177, 2016.

[12] Isabel Barbancho, Lorenzo J Tardon, Ana M Barbancho, and Simone Sammartino. Pitch and played stringestimation in classic and acoustic guitars. In Audio Engineering Society Convention 126, 2009.

[13] Jakob Abeßer. Automatic string detection for bass guitar and electric guitar. In International Symposium onComputer Music Modeling and Retrieval, pages 333–352. Springer, 2012.

[14] Christian Dittmar, Andreas Männchen, and Jakob Abeber. Real-time guitar string detection for music educationsoftware. In 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS),pages 1–4. IEEE, 2013.

[15] Tobias Geib, Maximilian Schmitt, and Björn Schuller. Automatic guitar string detection by string-inverse frequencyestimation. INFORMATIK 2017, 2017.

[16] Jacob Møller Hjerrild and Mads Græsbøll Christensen. Estimation of guitar string, fret and plucking positionusing parametric pitch estimation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), pages 151–155. IEEE, 2019.

10

Page 11: A PREPRINT - arXiv

A PREPRINT - DECEMBER 8, 2020

[17] Jacob Møller Hjerrild, Silvin Willemsen, and Mads Græsbøll Christensen. Physical models for fast estimation ofguitar string, fret and plucking position. In 2019 IEEE Workshop on Applications of Signal Processing to Audioand Acoustics (WASPAA), pages 155–159. IEEE, 2019.

[18] Zulfadhli Mohamad, Simon Dixon, and Christopher Harte. Pickup position and plucking point estimation on anelectric guitar via autocorrelation. The Journal of the Acoustical Society of America, 142(6):3530–3540, 2017.

[19] Zulfadhli Mohamad, Simon Dixon, and Christopher Harte. Pickup position and plucking point estimation on anelectric guitar. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),pages 651–655. IEEE, 2017.

[20] Jakob Abeßer, Hanna Lukashevich, and Gerald Schuller. Feature-based extraction of plucking and expressionstyles of the electric bass guitar. In 2010 IEEE International Conference on Acoustics, Speech and SignalProcessing, pages 2290–2293. IEEE, 2010.

[21] Gerald Schuller, Jakob Abeßer, and Christian Kehling. Parameter extraction for bass guitar sound models includingplaying styles. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),pages 404–408. IEEE, 2015.

[22] Li Su, Li-Fan Yu, and Yi-Hsuan Yang. Sparse cepstral, phase codes for guitar playing technique classification. InISMIR, pages 9–14, 2014.

[23] Chien-Yao Wang, Pao-Chi Chang, Jian-Jiun Ding, Tzu-Chiang Tai, Andri Santoso, Yu-Ting Liu, and Jia-ChingWang. Spectral-temporal receptive field-based descriptors and hierarchical cascade deep belief network for guitarplaying technique classification. IEEE Transactions on Cybernetics, 2020.

[24] Raphael Foulon, Pierre Roy, and François Pachet. Automatic classification of guitar playing modes. In Interna-tional Symposium on Computer Music Multidisciplinary Research, pages 58–71. Springer, 2013.

[25] Ana M Barbancho, Anssi Klapuri, Lorenzo J Tardón, and Isabel Barbancho. Automatic transcription of guitarchords and fingering from audio. IEEE Transactions on Audio, Speech, and Language Processing, 20(3):915–921,2011.

[26] Kerstin Dosenbach, Wolfgang Fohl, and Andreas Meisel. Identification of individual guitar sounds by supportvector machines. In Proceedings of the Conference on Digital Audio Effects (DAFx), 2008.

[27] David Johnson and George Tzanetakis. Guitar model recognition from single instrument audio recordings. In2015 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM), pages370–375. IEEE, 2015.

[28] Renato de Castro Rabelo Profeta and Gerald Schuller. Feature-based classification of electric guitar types. In JointEuropean Conference on Machine Learning and Knowledge Discovery in Databases, pages 478–484. Springer,2019.

[29] Renato Profeta and Gerald Schuller. Comparison of human and machine recognition of electric guitar types. InAudio Engineering Society Convention 147. Audio Engineering Society, 2019.

[30] Michael Stein, Jakob Abeßer, Christian Dittmar, and Gerald Schuller. Automatic detection of audio effects inguitar and bass recordings. In Audio Engineering Society Convention 128, 2010.

[31] Michael Stein. Automatic detection of multiple, cascaded audio effects in guitar recordings. In Proceedings of the13th International Conference on Digital Audio Effects (DAFx), pages 4–7, 2010.

[32] Felix Eichas, Marco Fink, and Udo Zölzer. Feature design for the classification of audio effect units by input/outputmeasurements. DAFx-15, 2015.

[33] Maximilian Schmitt and Björn Schuller. Recognising guitar effects-which acoustic features really matter?INFORMATIK 2017, 2017.

[34] Felix Eichas and Udo Zölzer. Gray-box modeling of guitar amplifiers. Journal of the Audio Engineering Society,66(12):1006–1015, 2018.

[35] Henrik Jürgens, Reemt Hinrichs, and Jörn Ostermann. Recognizing guitar effects and their parameter settings.[36] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto.

librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference,volume 8, pages 18–25, 2015.

[37] Jakob Abeßer. A review of deep learning based methods for acoustic scene classification. Applied Sciences, 10(6),2020.

[38] Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098,2017.

11

Page 12: A PREPRINT - arXiv

A PREPRINT - DECEMBER 8, 2020

[39] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,2014.

12