NEURAL NETWORKS FOR AUTOMATIC POLYPHONIC PIANO …
Transcript of NEURAL NETWORKS FOR AUTOMATIC POLYPHONIC PIANO …
NEURAL NETWORKS FOR AUTOMATIC POLYPHONIC PIANO MUSIC
TRANSCRIPTION
by
JOHNATHON MICHAEL ENDER
B.S., University of Wisconsin Madison, 2013
A thesis submitted to the Graduate Faculty of the
University of Colorado Colorado Springs
in partial fulfillment of the
requirements for the degree of
Master of Science
Department of Computer Science
2018
©2018
JOHNATHON MICHAEL ENDER
ALL RIGHTS RESERVED
This thesis for the Master of Computer Science degree by
Johnathon Michael Ender
has been approved for the
Department of Computer Science
by
Jugal Kalita, Chair
Albert Glock
Sudhanshu Semwal
Date: December 11, 2018
ii
Ender, Johnathon Michael (M.S., Computer Science)
Neural Networks for Automatic Polyphonic Piano Music Transcription
Thesis directed by Professor Jugal Kalita
ABSTRACT
Automatic Music Transcription is the process of producing an accurate musical representa-
tion from a polyphonic audio signal and continues to challenge state-of-the -art techniques.
To produce reliable transcriptions, human transcribers must estimate the active notes within
an audio sequence and apply a musical format, a task which requires extensive time and
expertise. While automatic transcription techniques do not produce the same level of ac-
curacy as human transcribers, modern machine learning techniques in the form of neural
network models have recently shown improvement in the area. This thesis investigates the
task of automatic polyphonic music transcription with the implementation and variation of
supervised machine learning models trained on various partitions of the MAPS dataset. The
idea is to generate a frame based pitch activation matrix from an audio sequence which is
subsequently classified into active or inactive note events. The results are presented as iter-
ative improvements on network models, in terms of transcription accuracy, which includes
findings on the implementation of Bidirectional Long Short Term Memory and Network
Ensembling to significantly improve transcription results.
iii
TABLE OF CONTENTS
CHAPTER
I Introduction 1
1 Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
II State of the Art in Polyphonic Piano Music Transcription 7
1 Sigtia et al. 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Wang et al. 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Liu et al. 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Valero-Mas et al. 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
III Developing an AMT network 14
1 AMT Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 The MAPS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
IV Preliminary AMT Network Investigation 22
1 Notes and Chords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Music and Chords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
iv
3 Network Ensembling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
V Improving AMT Network Models 33
1 LSTM Network Performance . . . . . . . . . . . . . . . . . . . . . . . . . 37
2 BLSTM Network Performance . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Ensemble Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
VI Conclusion 46
1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
REFERENCES 48
APPENDIX 51
A LSTM Network Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B BLSTM Network Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
C Ensemble Network Results . . . . . . . . . . . . . . . . . . . . . . . . . . 71
v
LIST OF FIGURES
1.1 Note: CQT and Label Comparison . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Chord: CQT and Label Comparison . . . . . . . . . . . . . . . . . . . . . 4
1.3 Music: CQT and Label Comparison . . . . . . . . . . . . . . . . . . . . . 5
2.1 Network Structure: Sigtia et al. 2016 . . . . . . . . . . . . . . . . . . . . . 9
2.2 Network Structure: Wang et al. 2018 . . . . . . . . . . . . . . . . . . . . . 10
2.3 Network Structure: Liu et al. 2018 . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Network Structure: Valero-Mas et al. 2018 . . . . . . . . . . . . . . . . . . 12
3.1 Basic LSTM AMT Network Structure . . . . . . . . . . . . . . . . . . . . 15
3.2 Label Generation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Notes and Chords Ground Truth: First 2000 frames . . . . . . . . . . . . . 23
4.2 Notes and Chords Transcription: First 2000 frames . . . . . . . . . . . . . 24
4.3 Notes and Chords Transcription Error: First 2000 frames . . . . . . . . . . 24
4.4 LSET Ground Truth: First 2000 Frames . . . . . . . . . . . . . . . . . . . 26
4.5 LSET Transcription: First 2000 Frames . . . . . . . . . . . . . . . . . . . 27
4.6 LSET Transcription Error: First 2000 Frames . . . . . . . . . . . . . . . . 27
4.7 Network Ensembling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.8 3La 5It Lset Transcription Error: First 2000 Frames . . . . . . . . . . . . . 30
4.9 Ens Miter Lset: First 2000 Frames . . . . . . . . . . . . . . . . . . . . . . 31
vi
5.1 Frame Context Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Two and Three Convolutional Layer Networks . . . . . . . . . . . . . . . . 34
5.3 Per File Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4 Bidirectional Recurrent Neural Network . . . . . . . . . . . . . . . . . . . 39
5.5 Transcription Error: Acoustic Dataset 1 - scn16 4 . . . . . . . . . . . . . . 42
5.6 Comparison of Composition Complexity . . . . . . . . . . . . . . . . . . . 44
5.7 Label and Transcription Comparison: scn16 4 . . . . . . . . . . . . . . . . 45
vii
LIST OF TABLES
3.1 MAPS: Instruments and Recording Conditions . . . . . . . . . . . . . . . 17
3.2 MIDI Event File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Investigation Data Partitions . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Evaluation Metrics: Notes and Chords . . . . . . . . . . . . . . . . . . . . 25
4.3 Evaluation Metrics: LSET . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Single Network Model Descriptions . . . . . . . . . . . . . . . . . . . . . 29
4.5 Single Network Model Results . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Ensemble Network Contents . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.7 Ensemble Network Results . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1 Context Window Results - Two Convolutional Layers . . . . . . . . . . . . 36
5.2 Context Window Results - Three Convolutional Layers . . . . . . . . . . . 37
5.3 Test Set Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 LSTM Network Result Summary . . . . . . . . . . . . . . . . . . . . . . . 38
5.5 BLSTM Network Result Summary . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Ensemble Network Contents . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.7 Ensemble Network Result Summary . . . . . . . . . . . . . . . . . . . . . 41
5.8 Ensemble Best and Worst File Results . . . . . . . . . . . . . . . . . . . . 42
5.9 Network Result Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
viii
6.1 State of the Art Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.1 Digital Results: Single LSTM - Dataset 1 . . . . . . . . . . . . . . . . . . 51
A.2 Acoustic Results: Single LSTM - Dataset 1 . . . . . . . . . . . . . . . . . 52
A.3 Digital Results: Single LSTM - Dataset 2 . . . . . . . . . . . . . . . . . . 53
A.4 Acoustic Results: Single LSTM - Dataset 2 . . . . . . . . . . . . . . . . . 54
A.5 Digital Results: Single LSTM - Dataset 3 . . . . . . . . . . . . . . . . . . 55
A.6 Acoustic Results: Single LSTM - Dataset 3 . . . . . . . . . . . . . . . . . 56
A.7 Digital Results: Single LSTM - Dataset 4 . . . . . . . . . . . . . . . . . . 57
A.8 Acoustic Results: Single LSTM - Dataset 4 . . . . . . . . . . . . . . . . . 58
A.9 Digital Results: Single LSTM - Dataset 5 . . . . . . . . . . . . . . . . . . 59
A.10 Acoustic Results: Single LSTM - Dataset 5 . . . . . . . . . . . . . . . . . 60
B.1 Digital Results: Bidirectional LSTM - Dataset 1 . . . . . . . . . . . . . . . 61
B.2 Acoustic Results: Bidirectional LSTM - Dataset 1 . . . . . . . . . . . . . . 62
B.3 Digital Results: Bidirectional LSTM - Dataset 2 . . . . . . . . . . . . . . . 63
B.4 Acoustic Results: Bidirectional LSTM - Dataset 2 . . . . . . . . . . . . . . 64
B.5 Digital Results: Bidirectional LSTM - Dataset 3 . . . . . . . . . . . . . . . 65
B.6 Acoustic Results: Bidirectional LSTM - Dataset 3 . . . . . . . . . . . . . . 66
B.7 Digital Results: Bidirectional LSTM - Dataset 4 . . . . . . . . . . . . . . . 67
B.8 Acoustic Results: Bidirectional LSTM - Dataset 4 . . . . . . . . . . . . . . 68
B.9 Digital Results: Bidirectional LSTM - Dataset 5 . . . . . . . . . . . . . . . 69
ix
B.10 Acoustic Results: Bidirectional LSTM - Dataset 5 . . . . . . . . . . . . . . 70
C.1 Digital Results: Ensemble - Dataset 1 . . . . . . . . . . . . . . . . . . . . 71
C.2 Acoustic Results: Ensemble - Dataset 1 . . . . . . . . . . . . . . . . . . . 72
C.3 Digital Results: Ensemble - Dataset 2 . . . . . . . . . . . . . . . . . . . . 72
C.4 Acoustic Results: Ensemble - Dataset 2 . . . . . . . . . . . . . . . . . . . 73
C.5 Digital Results: Ensemble - Dataset 3 . . . . . . . . . . . . . . . . . . . . 73
C.6 Acoustic Results: Ensemble - Dataset 3 . . . . . . . . . . . . . . . . . . . 74
C.7 Digital Results: Ensemble - Dataset 4 . . . . . . . . . . . . . . . . . . . . 74
C.8 Acoustic Results: Ensemble - Dataset 4 . . . . . . . . . . . . . . . . . . . 75
C.9 Digital Results: Ensemble - Dataset 5 . . . . . . . . . . . . . . . . . . . . 75
C.10 Acoustic Results: Ensemble - Dataset 5 . . . . . . . . . . . . . . . . . . . 76
x
CHAPTER I
INTRODUCTION
Automatic Music Transcription (AMT) is a process which converts an audio signal to a
general purpose high-level symbolic representation of the content. The task of
transcription is a subfield in Music Information Retrieval (MIR), which has studied AMT
extensively due to its applications in music preservation and annotation, music similarity
and retrieval, among others [1]. Producing accurate transcriptions requires extensive time
and expertise when performed by human transcribers who estimate the active notes within
an audio sequence and apply a musical format. To accomplish this, such a transcriber must
be able to accurately identify several aspects within the audio including pitch, tempo, and
rhythm. This often requires an audio segment to be reviewed many times. The complex
and time consuming nature of the transcription task presents opportunity for the utilization
of machine learning techniques as an alternative to human labor. The following sections
provide a background on the composition and complexity of music then outlines the
results and content of this thesis.
1 Music
Piano music is composed of varying note and chord sequences, which represent melodic
and harmonic concepts. In terms of transcription, a note is represented as an individual
pitch with a beginning and an ending referred, to as onset and offset respectively, a chord
is a group of notes which share onset and offset timings. Ae sequence of notes and chords,
known as a melody, is structured by tempo, the speed at which the music is played, and
time signature which represents the rhythm. A melody is further structured by restricting
1
notes and chords to a key, representing a group of related pitches, though composers may
include notes outside of the key for emotional effect. On the piano, each hand may
produce a melody independently producing either monophonic or polyphonic sounds.
Polyphony consists of two or more lines of melody played simultaneously, where
monophony consists of an isolated melody played for a length of time. While automatic
monophonic music transcription is considered a solved problem [2], polyphonic music
continues to be difficult for both human experts and proposed AMT approaches.
Polyphony is characterized by concurrent note overlap in the time domain, which
increases complexity in the frequency domain and the overall audio signal [3]. To
illustrate monophonic and polyphonic music complexity, Figures 1.1, 1.2, and 1.3 provide
comparison between an audio spectrogram and the ground truth transcription for a single
note, the C7 chord on C2, and forty six seconds of classical music respectively.
Figures 1.1 and 1.2 represent monophonic pieces of audio, where the note and chord
signals are uninterrupted, yet illustrate the string resonance, harmonics, and noise that
occur even playing a single note. When visually compared to Figure 1.3, which represents
a polyphonic signal, the increased complexity is apparent. Thus, polyphonic AMT
presents a particularly difficult problem due to audio input signal variability, interaction
between concurrently sounding notes, and the harmonics that are produced by an
instrument such as a piano [4].
2
Figure 1.1: Note: CQT and Label Comparison
3
Figure 1.2: Chord: CQT and Label Comparison
4
Figure 1.3: Music: CQT and Label Comparison
5
2 Thesis Structure
The goal of this thesis is develop an understanding of, and improve upon, neural network
models used in automatic transcription of piano music. Improvement is measured in terms
of transcription accuracy which is produced by a neural network and compared to the
ground truth transcription. The networks discussed in this thesis take a spectrogram as
input and produce a binary representation of active notes at a given time. Figure 1.3 shows
examples of both the spectrogram input and expected neural network output for a piece of
polyphonic piano music. At a high level, accuracy of the output is determined by
comparing with the ground truth determining accurate and inaccurate notes at each time
step, otherwise referred to as a frame, and is discussed in further detail in Chapter 3. The
neural networks developed in this paper surpassed the state-of-the-art accuracy score of
0.7476 with the best of these networks reporting a score of 0.8057 for a 7.77% overall
improvement.
The following chapters outline the development, investigation, and iteration process of the
AMT neural network which outperformed state-of-the-art results. Chapter 2 discusses
background research into AMT neural network structures and state of the art techniques.
Based on that research, the Chapter 3 develops a network model, selects an appropriate
dataset, and outlines methods to evaluate neural network performance. Initial investigation
into the effects of dataset partitions and network structure follow, serving as a basis for
model improvement. In Chapter 4 augmentations to the network input and structure are
explored resulting in transcriptions that surpass state-of-the-art accuracy. Finally, the
thesis is concluded with a comparison of network results with reported state of the art
accuracy scores and a discussion of avenues for future work in AMT.
6
CHAPTER II
STATE OF THE ART IN POLYPHONIC PIANO MUSIC TRANSCRIPTION
Recent approaches to polyphonic AMT systems are composed of two components: an
initial Multipitch Estimation (MPE) stage which produces a non-binary two-dimensional
representation, referred to as a posteriorgram, that represents the active pitch probabilities
for each frame in a signal; and a Note Tracking (NT) stage that refines the MPE output,
acting as a correction step [1]. This two-stage strategy is analogous to speech recognition
methods with MPE strategies representing acoustic models and the NT stage as a Music
Language Model (MLM) which applies structural regularity to the MPE output [3].
The purpose of the MPE stage is to extract features from input data and produce pitch
probability on a per frame basis. There are a variety of common MPE strategies which
take a time-frequency representation, such as a spectrogram, as input. These include
Non-negative Matrix Factorization (NMF) [5], Probabilistic Latent Component Analysis
(PLCA) [6], and discriminative approaches which aim to classify features directly from
raw data or low level representations of audio signals. Recently neural networks have
shown excellent results in both PLCA and discriminative approaches to MPE [3][4][7].
Note tracking is a post-processing step which binarizes the posteriorgram provided by the
MPE stage. This is typically done by applying a threshold to the pitch activations, where
pitch probabilities less than the threshold are considered to be silence, and all others
assumed to be active. The result of this process can be considered a frame level
transcription, with the drawback of sensitivity to the MPE stage where false positives, and
over-segmentation of long note events diminish transcription quality [1]. Other methods
involve a rule-based system where notes shorter than a minimum length are pruned, and
7
small gaps between consecutive notes are filled, and requires such rules to be crafted by
hand, which is undesirable [8]. Similar to MPE stage implementations, neural network
based strategies for note tracking are providing improved transcription accuracy over
previous thresholding and rule-based methods [1][3][4].
Neural network based approaches to AMT are varied, but still retain the two-stage MPE
and NT structure. In 2016 Sigtia et al. proposed an end to end neural network with the
MPE and NT stages connected sequentially and trained in unison. Wang et al. furthered
this process with the addition of a note onset detection network in 2018. That same year
Liu et al. proposed an alternative network model, and Valero-Mas et al. investigated
several classifier and neural network based note tracking methods. In the following
sections each of these approaches are discussed in relative order.
1 Sigtia et al. 2016
Sigtia et al. implemented a two-stage Hidden Markov Model (HMM), a stochastic
statistical model where event probabilities are dependent on unobserved previous states,
for the AMT network which combined both the MPE and NT stages into a single
end-to-end solution. The approach, which continued Sigtia’s previous work in [9] and
[10], utilized a Convolutional Neural Network (CNN) and Recurrent Neural Network
(RNN) as the MPE and NT stages respectively, to improve transcription accuracy. A
visualization of the network structure is provided in Figure 2.1.
A CNN, used in the acoustic model, is a specialized neural network composed of
convolutional layers followed by pooling layers. In a convolutional layer, a set of weights
are defined, referred to as a convolutional kernel, and multiplied by a region of input data
in a grid pattern to produce a feature map which preserves the spatial structure of the input
data. A subsequent pooling layer simplifies the feature map, reducing dimensionality by
8
Figure 2.1: Network Structure: Sigtia et al. 2016
taking values such as regional min, max, and average. The resulting network provides an
effective means of describing changes in local regions of the input data [7].
The RNN, used by Sigtia for note tracking, is a variant of standard feed-forward networks
designed to process sequential data. In addition to the vector input and output, the
recurrent network retains input and output states. These states provide a form of memory
allowing the network to apply previously learned information to each element as the
sequence is processed [11]. The memory provided by recurrent neural networks is
beneficial to improving transcription accuracy due to the highly correlated nature of
pitches in polyphonic music (harmonics, chords) [4].
To produce a binarized transcription, Sigtia et al. utilize a high dimensional hashed beam
search, a heuristic algorithm that traverses a limited number of promising nodes within a
graph, to select note combinations and sequences with the highest probability. This is in
contrast to thresholding methods which consider any note above a specified probability to
be active. The reported F-Measure of 0.7476 for the frame based analysis of the
end-to-end model outperformed existing PLCA and NMF models by 4%-10% and an
9
absolute improvement over existing neural network models by 5%. Sigtia et al.’s work in
2016 continues to be a basis and reference for current neural network AMT models.
2 Wang et al. 2018
Wang et al. augmented the network described by Sigtia et al.’s 2016 work with the
addition of a note onset detection information feeding into the MPE CNN. Onset detection
is performed by a CNN with the same structure as the MPE stage, which was augmented
to include additional convolutional layers, with the resulting high level network
architecture shown in Figure 2.2.
Figure 2.2: Network Structure: Wang et al. 2018
In addition to the onset detection network, the RNN in the NT stage was substituted by
Long Short Term Memory (LSTM). A LSTM cell is an augmented RNN unit which
solves the vanishing and exploding gradient problems with the inclusion of a forget gate.
10
Vanishing and exploding gradients inhibit learning within RNNs and are caused by
previously learned dependencies being held too long. The introduction of the forget gate
allows the LSTM to learn what information to retain through memory during each time
step [12]. Finally, post-processing with a local beam search was compared against global
beam search and thresholding methods for transcription accuracy. Both beam search
binarization methods provided negligible improvement over thresholding with a best
reported F-Measure of 0.7451 for acoustic piano transcription [3].
3 Liu et al. 2018
In an alternate AMT, model Liu et al. proposed a two-channel CNN processing pitch
estimation and note onset in parallel, as shown in 2.3. Each CNN was specialized to its
purpose with the use of differing convolutional kernel and max pooling region sizes.
Figure 2.3: Network Structure: Liu et al. 2018
The outputs of these two networks were combined in the note tracking stage, which
differed from Sigtia and Wang, with the use of a Multilayer Perception (MLP) network
rather than a type of RNN. A MLP network is a feed-forward network with multiple
internal layers; this type of network differs from RNNs as it does not maintain any
11
knowledge of previous inputs. The output of the MLP based note tracking stage was
processed with thresholding resulting in a reported F-Measure of 0.6502 for frame based
transcription of acoustic piano recordings [7].
4 Valero-Mas et al. 2018
To investigate note tracking methods, Valero-Mas et al. proposed a network model where
the acoustic model output consisted of three types. Similarly to the model introduced by
Liu et al., the acoustic model processed note onset and MPE in parallel, but also contained
the addition of a preliminary binarization of the MPE.
Figure 2.4: Network Structure: Valero-Mas et al. 2018
This binarized MPE was produced using thresholding with additional post-processing, to
12
reduce note segmentation and spurious notes, using filters. This variation of the acoustic
model was kept consistent throughout the testing of various NT stages using classifiers,
Support Vector Machines (SVM), and MLP networks. Direct comparison of transcription
F-Measure is conducted against an implementation of Sigtia et al.’s 2016 network which
achieved an F-Measure of 0.65 in testing. Of the note tracking strategies tested,
transcription results from the MLP based NT stage reported the highest F-Measure of 0.70
for frame based evaluations showing an improvement of 5% when perfectly accurate note
onset information is provided to the NT stage.
5 Chapter Summary
The state-of-the-art methods discussed in this chapter share a basic two-stage structure
consisting of an MPE and NT stage respectively. Differences in the MPE stage of each
network include convolutional network organization, and the addition of note onset
information. In these state-of-the-art networks, a variety of NT stages are investigated
including RNN, LSTM, MLP, and SVM implementations. Of these networks, Sigtia et
al.’s has the simplest design and will be used as a basis for initial network development in
Chapter 3.
13
CHAPTER III
DEVELOPING AN AMT NETWORK
To cultivate an understanding of neural network based AMT, a basic model is considered
for initial investigation. The following sections cover the three high level requirements to
investigate and improve AMT networks. First, a network model is developed with
guidance from state-of-the-art methods. Next, an appropriate dataset is selected, with
which to train and test the network model. Finally, evaluation criteria are discussed by
which network transcription accuracy is measured.
1 AMT Network Model
The initial model for AMT investigation was implemented as a two-stage HMM, which
includes the MPE and NT stages in series, based on Sigtia et al.’s work in 2016. This
model was chosen for its simplicity of understanding and implementation, when
compared to deeper models requiring a separate note onset detection model. The MPE
stage, or acoustic model, consists of a single convolutional block taking a sequence of one
dimensional feature vectors as input. A convolutional block refers to a group of
computations, which includes a convolution, followed by a max pooling, and finally
dropout. This convolutional block is followed by flattening and batch normalization layers.
Each vector in the input sequence, discussed further in the following section, is a single
frame of preprocessed audio in the form of a spectrogram that has a frame length of 264.
14
Figure 3.1: Basic LSTM AMT Network Structure
The convolution portion of the convolutional block, as discussed in Chapter 2 Section 1, is
used to spatially map input features to detect and separate individual pitches from the
spectrogram. Sixty four filters with a convolutional kernel of length three, with a stride of
one were used to detect edges within the data. Max pooling was used by Wang et al. to
reduce the convolutional output dimensionality by selecting the maximum value in a
region. For this network a max pooling size of two was selected, reducing the convolution
output size by half [3]. Finishing the convolutional block, a process known as Dropout is
used as an effective measure to prevent overfitting, which occurs when the network output
corresponds too closely to training data. This renders the neural network unable to
accurately predict unseen data in the testing sets [13]. Dropout prevents overfitting of the
15
network by removing hidden nodes based on a implementation specific probability and is
commonly used in state-of-the-art methods [1][3][4]. A dropout probability of 0.25 was
used within the convolutional block. The final two layers in the acoustic model consist of
a flattening stage, which restructures the convolutional output matrix into a one
dimensional vector, and batch normalization layer added to reduce training time [14].
The resulting output of this MPE stage is a pitch activation probability matrix, or
posteriorgram, representing a sequence of activation probabilities for each of the 88 piano
pitches. This is directly input into the NT stage which acts as a smoothing function [4],
and is comprised of a single fully connected LSTM with a dropout value of 0.1 followed
by a dense layer. A fully connected or dense, layer is one where each node in a layer has
connections with every node in the preceding layer. A diagram of the basic AMT network
showing all of the component layers is provided in Figure 3.1.
This network model and all subsequent models discussed in future chapters, were
implemented with the Keras library, which is a high-level neural network API with
support for conducting network computations on a Graphics Processing Unit (GPU) [15].
The binary cross-entropy loss function, provided by Keras, was determined to be the most
appropriate as several output nodes can be active at a time. The Adam optimizer was
chosen to increase training performance [16]. Network models were trained on a Windows
10 machine with an NVIDIA GeForce GTX 980 Ti graphics card containing six gigabytes
of standard DDR5 video memory.
2 The MAPS Dataset
To train and test the proposed network, the MIDI Aligned Piano Sounds (MAPS) dataset
[17] was selected for its completeness and diversity. While other sources of MIDI-aligned
piano audio such as LabROSA [18] or Batik [19] are available, they possess limited scope
16
containing 29 and 13 recordings respectively. Additionally, the MAPS dataset is the basis
for many reported state-of-the-art transcription results [1][3][4][7].
The MAPS dataset contains approximately 40 gigabytes, i.e. about 65 hours, of piano
recordings. This includes audio recordings in .wav format, Musical Instrument Digital
Interface (MIDI) ground truth files, and text files representations of the MIDI events. The
dataset is divided into nine coded subsets according to the instrument and location used to
record the audio as shown in Table 3.1. Each subset is further divided into groups with
each containing a comprehensive set of isolated notes (ISOL), random chords (RAND),
common chords (UCHO), and a limited selection of complete classical music
compositions (MUS).
Table 3.1: MAPS: Instruments and Recording Conditions
Subset Recording Conditions Real Instrument or Software
StbgTGd2 Software default The Grand 2 (Steinberg)
AkPnBsdf Church Acoustic Piano (Native Instruments)
AkPnBcht Concert Hall Acoustic Piano (Native Instruments)
AkPnCGdD Studio Acoustic Piano (Native Instruments)
AkPnStgb Jazz Club Acoustic Piano (Native Instruments)
SptkBGAm “Ambient” The Black Grand (Sampletekk)
SptkBGCl “Close” The Black Grand (Sampletekk)
ENSTDkAm “Ambient” Real Piano (Disklavier)
ENSTDkCl “Close” Real Piano (Disklavier)
Of these sets, ISOL is the most diverse containing single, long, staccato, and repeated
notes. The ISOL recordings also contain sets for chromatic scales and trills. Further all
samples in ISOL contain variations for loudness P (piano), M (mezzo-forte), F (forte) and
usage of the damper pedal, which sustains notes for longer than the key is held and also
increases harmonics within the instrument. RAND provides procedurally generated sets of
random chords containing 2-7 notes without musical knowledge applied, and provides
varying loudness and damper pedal application. The UCHO set provides usual chords
17
from Western music such as classical or jazz in the same manner as RAND. Finally, the
MUS set provides 270 pieces of classical and traditional music generated from standard
MIDI files available from Goto et al. under the Creative Commons license [20]. Each line
in Table 1 contains 30 randomly chosen pieces of music [17].
2.1 Preprocessing
To prepare the audio file for input into the neural network it must first be converted into a
spectrogram representation. The Constant Q Transform (CQT) has been shown to be
fundamentally better suited as a time frequency representation due to the frequency of
music notes being linear in pitch [21]. Additionally, the CQT is frequently chosen in state
of the art methods due to its lower dimensional representation when compared to
Short-Time Fourier Transform (STFT) thus reducing the number of input parameters into
the network [3][4][7].
The librosa library is a Python API which provides audio processing functionality and was
used to compute the CQT for each audio file in the dataset [22]. Each file was sampled at
22.05 kHz and the CQT was computed for each key frequency from A0 to C8, 27.5 Hz to
4186.01 Hz, at 3 bins per key and a hop size of 512 samples. This resulted in a
264-dimensional input vector sequence with a period of 23.6 ms. The CQT can thus be
represented as matrix with shape (264, t) where t is the total number of samples in the
audio input divided by the hop size, the transpose of CQT matrix the taken and saved for
input into the network.
The ground truth labels also required preprocessing for compatibility with the neural
network. A custom function was written to map the MIDI events from a text file, with
format exemplified in Table 3.2, to a label matrix with shape (t, 88) representing the 88
key state. A process is outlined by the algorithm in Figure 3.2 to convert the event file into
18
Table 3.2: MIDI Event File Format
Onset Offset MIDI Pitch
0.500004 2.500006 72
0.500004 2.500006 73
0.500004 3.500006 86
Onset and Offset values are represented in seconds. MIDI pitches range form 21 to 108 for piano.
a label representation.
Figure 3.2: Label Generation Algorithm
function GENERATELABELS(audioF ile, eventF ile, hopSize)numSamples← GETSAMPLENUMBER(audioF ile)timeStep← 1/GETSAMPLERATE(audioF ile)labels← [numSamples, 88]for all lines in eventF ile do
start← line[0]/timeStepend← line[1]/timeStepnote← line[3]− 21labels[start : end, note] = 1
end for
labels← DOWNSAMPLE(labels, hopSize)return labels
end function
First, the corresponding audio file is loaded and sampled at 22.05 kHz to retrieve the total
number of samples before downsampling. An empty matrix of shape (t, 88) is created to
represent the labels where t is the number of frames. Each line of the MIDI event file is
then read and all samples between the onset and offset time inclusive, are set to active.
Finally, the label matrix is downsampled with a hop size of 512, rendering the number of
frames equivalent to that in the corresponding CQT. Examples of the CQT and label
output are shown in Figures 1.1, 1.2, and 1.3.
19
3 Evaluation
Results produced by the network are evaluated for Accuracy, Precision, Recall, and
F-Measure with the latter being the preferred measure of transcription correctness [1].
F-Measure provides a balanced metric between Precision, a measure of how many
instances are classified correctly, and Recall which a measure of how many instances are
missed. Precision and Recall provide better insight into model suitability when compared
to Accuracy for datasets with imbalanced classed such as audio signals.
Accuracy =NTP
NTP +NFP +NFN
Precision =NTP
NTP +NFP
Recall =NTP
NTP +NFN
F −Measure =2 · Precision ·Recall
Precision+Recall
where NTP, NFP, NFN, are the number of true positives, false positives, and false negatives
respectively. The Scikit-Learn library was used in all cases to calculate the evaluation
metrics [23].
4 Chapter Summary
In this chapter, a basic AMT network was developed and discussed based on the two-stage
models used by state-of-the-art methods. The MAPS dataset was selected for training and
testing due to its robust selection of piano recordings. Finally, evaluation metrics were
discussed and selected for comparison against state-of-the-art results. With a basic AMT
network implementation, dataset, and evaluation metrics chosen initial AMT investigation
20
is conducted in Chapter 4.
21
CHAPTER IV
PRELIMINARY AMT NETWORK INVESTIGATION
With the network structure, dataset, and evaluation metrics determined in Chapter 3, the
effects of data partitions, training iterations (epochs), and network structure can be
explored. A data partition determines which portions of the dataset constitute the training,
validation, and testing data for the neural network experiment. Three data partitions,
detailed in following sections, were developed for investigation and are outlined in
Table 4.1.
Table 4.1: Investigation Data Partitions
Audio Type
Name MAPS Subsets Training Testing
Notes & Chords SptkBGCl, StbgTGd2 ISOL, RAND, UCHO MUS
Limited Set (LSET) All RAND, UCHO, MUS MUS
Complete Set (CSET) All ISOL, RAND, UCHO, MUS MUS
The basic AMT network described in Section 3.1 requires a fixed sequence of frames as
input for training and testing. A sequence length of 100 was selected, representing 2.36
seconds of audio, to capture music language features. For each data partition, a single
feature and label matrix pair was constructed by concatenating the respective input and
output data together. These matrices were then divided into sequences of length 100 for
input into the neural network. The following sections explore the effects on transcription
accuracy when training on only notes and chords, the addition of music and increased
training iterations, and the affects of network ensembling.
22
1 Notes and Chords
To test whether a network could successfully be trained on only notes and chords, a data
partition by the same name was compiled. Data selection was limited to two of the nine
MAPS subsets, SptkBGCl and StbgTGd2, to facilitate short training and testing times.
From these two groups, the training set included all the files from the ISOL, UCHO, and
RAND audio files. Construction of the validation and testing sets was completed using a
permutation of the list of MUS files in the chosen MAPS subsets. The first ten music
pieces were selected as the validation set, and the next ten were selected as the testing set.
Figure 4.1: Notes and Chords Ground Truth: First 2000 frames
The AMT network was trained for five iterations with a batch size of thirty two which
required 368 seconds, or 6.1 minutes to complete. Figure 4.1 shows the label values for
the first two thousand frames of the output, with Figure 4.2 visualizing the transcription
produced by the network. A brief visual inspection of Figure 4.2 reveals that the network
did not perform adequately, with very little resemblance to Figure 4.1.
Prediction metrics yielded an F-Measure slightly below 3% for this subset. The errors in
23
Figure 4.2: Notes and Chords Transcription: First 2000 frames
Figure 4.3: Notes and Chords Transcription Error: First 2000 frames
the transcription are visualized in Figure 4.3, with false negatives represented in blue, and
false positives in red. Table 4.2 records the prediction metrics calculated against the entire
output of the testing set. Several modifications to the network were made to improve these
24
Table 4.2: Evaluation Metrics: Notes and Chords
Iteration Accuracy Precision Recall F-Measure
5 0.0967 0.2022 0.0749 0.1015
metrics with the number and type of layers, iterations, and batch size were altered with
negligible improvement to transcription accuracy indicating that changes to the data set
were required.
2 Music and Chords
The preliminary training sets were constructed on the premise that a network could be
successfully trained to transcribe music using only isolated notes, and several variations of
individual chords. This assumption was found to be false as melodies contained in music
are composed of sequences of chords and notes. As discussed in Chapter 1, these
sequences cause additional harmonics in the instrument resulting in significantly different
power spectra and is evidenced by comparing Figures 4.1 and 4.2. Additionally, musical
language information such as key, rhythm, and tempo are unavailable when training only
on notes and chords. This issue of polyphonic harmonics and addition of music language
information is easily resolved by training the network on files from the MUS set [4].
The datasets were reconstructed with the training set consisting of the complete UCHO
and RAND sets, and MUS files that were not included in the validation or testing sets. The
ISOL set was excluded from the training set as preliminary experiments, not discussed in
this thesis, demonstrated network improvement with its exclusion. This recompiled set
was referred to as the Limited Training Set (LSET) due to its exclusion of the ISOL set.
The network was trained using LSET twice for five and fifteen iterations. Each round of
training was conducted with a batch size of thirty two, taking around 280 seconds per
25
iteration.
Figure 4.4 shows the label values for the first two thousand frames of the testing set
output. The transcription shown in Figure 4.5 bears a much stronger resemblance to the
expected output indicating significant improvement in network performance with the
addition of MUS data to the training set. The F-Measure for this slice of the output was
calculated to be 0.7770, which is twenty six times greater than what was found on similar
slice of the previous network’s output. Figure 4.6 represents the transcription errors and
illustrates that the network continues to have difficulty recognizing some notes. Further
inspection of the false positives shows that these errors commonly occur after an actual
note has played which indicates poor note offset detection.
Figure 4.4: LSET Ground Truth: First 2000 Frames
Table 4.3 compares the prediction metrics for the entire network’s output for both
instances of the AMT network trained on the LSET dataset. Increasing the number of
epochs yielded an increase of 0.01947 in the F-Measure, equaling a 2% increase in
accuracy. Training on the LSET dataset improved the F-Measure of the network by 6.75
times that of the initial network demonstrating that training the network on actual music is
26
Figure 4.5: LSET Transcription: First 2000 Frames
Figure 4.6: LSET Transcription Error: First 2000 Frames
required to get reasonable transcription accuracy.
27
Table 4.3: Evaluation Metrics: LSET
Iteration Accuracy Precision Recall F-Measure
5 0.1976 0.7703 0.6429 0.6662
15 0.2284 0.7511 0.6899 0.6857
3 Network Ensembling
An ensemble of neural networks has been shown to increase general network performance
by reducing the impact of local minima caused by weight initialization in similar
networks. Network improvement is achieved by reconciling the output of two or more
networks via methods such as voting or averaging [24]. To create a network ensemble, two
or more trained models are loaded and supplied input in parallel, with the output of each
model being fed into a merging layer which provides the ensemble output. A high level
representation of this idea is presented in Figure 4.7. This idea was extended to different
AMT models trained for a variable number of epochs on differing data sets.
Figure 4.7: Network Ensembling
The time distributed polyphonic AMT network described in Figure 3.1 was selected for
this experiment. Based on the networks by Wang et al. and Liu et al., a deeper MPE stage
28
consisting of multiple convolutional blocks is investigated for improvement in
transcription accuracy [3][7]. The basic AMT network was copied and extended
increasing the number of convolutional blocks within the MPE stage. In addition to
increasing the depth of the network model, the LSET data partition was augmented with
the ISOL files completing the MAPS dataset. This data partition is referred to as the
Complete Set (CSET). Eight network models are described in Table 4.4 for use in the
network ensemble with networks results enumerated in Table 4.5.
Table 4.4: Single Network Model Descriptions
Convolutional
Name Blocks Iterations Data Set
1La 5It Lset 1 5 LSET
1La 5It Cset 1 5 CSET
1La 10It Lset 1 10 LSET
1La 10It Cset 1 10 CSET
3La 5It Lset 3 5 LSET
3La 5It Cset 3 5 CSET
3La 10It Lset 3 10 LSET
3La 10It Cset 3 10 CSET
Table 4.5: Single Network Model Results
Name Precision Recall F-Measure
1La 5It Lset 0.7709 0.6404 0.6646
1La 5It Cset 0.7564 0.6331 0.6482
1La 10It Lset 0.7652 0.6591 0.6743
1La 10It Cset 0.7311 0.6855 0.6689
3La 5It Lset 0.7061 0.7336 0.6871
3La 5It Cset 0.7247 0.6104 0.6241
3La 10It Lset 0.7005 0.7055 0.6714
3La 10It Cset 0.7098 0.6568 0.6472
Of the eight network models tested the 3La 5It Lset network, which consists of three
convolutional layers trained for five epochs on the LSET dataset, performed the best with
an F-Measure of 0.6871. Figure 4.8 illustrates the transcription error in the first 2000
29
frames of the testing set for this network.
Figure 4.8: 3La 5It Lset Transcription Error: First 2000 Frames
To merge the individual network output in the AMT ensemble, an averaging layer was
used. A post-processing threshold of 0.4 used in previous tests is maintained to produce a
final transcription. Table 4.6 describes four ensemble networks with their component
models, which were evaluated for general network improvement.
Table 4.6: Ensemble Network Contents
Name Models
Ens 5lt Lset 1La 5It Lset 3La 5It Lset – –
Ens 5lt Cset 1La 5It Cset 3La 5It Cset – –
Ens 5It Mset 1La 5It Lset 1La 5It Cset 3La 5It Lset 3La 5It Cset
Ens Miter Lset 1La 5It Lset 3La 5It Lset 1La 10It Lset 3La 10It Lset
Table 4.7 compares the output of the four ensemble networks, which excluding the CSET
only models, performed better than individual networks in terms of F-Measure. The
Ens 5lt Cset results can be dismissed as the component models generally performed the
worst out of all individual networks. The Ens Miter Lset ensemble consisting of varied
30
Table 4.7: Ensemble Network Results
Name Precision Recall F-Measure
Ens 5lt Lset 0.7711 0.6996 0.7022
Ens 5lt Cset 0.7746 0.6461 0.6675
Ens 5It Mset 0.7902 0.6831 0.6998
Ens Miter Lset 0.7857 0.7178 0.7230
models and training times on the LSET dataset performed the best with an improvement
of 0.0359 in F-Measure when compared to the best single network. Figure 4.9 illustrates
the transcription error for the first 2000 frames of the Ens Miter Lset ensemble. The
ensembled networks consistently scored better in precision by reducing the number of
false positives in red, but suffered a relatively smaller decrease in recall increasing false
negatives (blue), resulting in an increased number of missing notes.
Figure 4.9: Ens Miter Lset: First 2000 Frames
31
4 Chapter Summary
Preliminary investigation into AMT revealed several key concepts to consider as the
neural networks are improved further. First, training the network on music allows for the
learning of musical structure and harmonic patterns which is required to achieve adequate
transcription accuracy. Next, the addition of multiple convolutional blocks within the
MPE stage improved transcription accuracy scores for individual networks. Finally, the
application of AMT network ensembling further improved final transcription accuracy and
should be considered in future experiments.
32
CHAPTER V
IMPROVING AMT NETWORK MODELS
Evaluation of the transcription errors generated by the preliminary networks demonstrates
there is considerable room for improvement in transcription accuracy. Note onset and
offset detection remains an issue, but initial results are promising. A frame context
window is an improvement to the preliminary networks that augments the frame by frame
input to a two dimensional matrix [3]. Figure 5.1 illustrates the enhancement where a
window of 2k + 1 frames, centered on time t, is used to determine the label for time t. A k
value of four was chosen for the remaining experiments resulting in an input matrix with a
shape of 9 x 264 at each time slice.
Figure 5.1: Frame Context Window
To accommodate the new two-dimensional input shape and the increased network
parameters, multiple modifications were made to the existing network models. Altering
the convolutional layers from one-dimensional to two-dimensional input represents the
primary change, with the number of filters on each convolutional layer being reduced to
enable the network to fit within GPU memory. The two-dimensional input’s effect on
33
Figure 5.2: Two and Three Convolutional Layer Networks
34
transcription accuracy was evaluated by implementing two distinct networks with two and
three convolutional layers respectively, which are outlined in Figure 5.2. The MPE stage
in each network is represented in the subnetwork of convolutional layers through the batch
normalization layer, with note tracking represented by the LSTM and dense output layers.
Of note in the three convolutional layer network is the lack of a max pooling layer after
the first convolutional layer, and the second max pooling layer’s pool size is reduced from
(2,2) to (1,2). These differences ensure proper tensor shape as data is processed by the
network layers, this mitigates data loss by ensuring a many-to-one or one-to-one
relationship from the hidden units of the final convolutional layer to those in the LSTM
layer.
Figure 5.3: Per File Network Training
function TRAINNETWORK(dataset, epochs)files← GETDATASETFILES(dataset)iteration← 1while iteration ≤ epochs do
SHUFFLE(files)for all files do
data← GETDATA(file)labels← GETLABELS(file)KERASTRAIN(data, labels)
end for
end while
end function
In addition to the frame context window, the network training and testing method was
improved for finer control. Whereas the preliminary networks were trained and tested by
concatenating the files of the respective datasets together before slicing into 100 sample
sequences, the algorithm in Figure 5.3 allows for per file control during training and
testing. This method requires the entire audio sequence be padded with frames of -80dB,
representing silence, until it is divisible by the sequence length. The padding frames are
prevented from altering network training with the addition of a masking layer on the input,
35
which skips frames whose values all equal the masking value of -80dB.
An evaluation dataset for the two and three convolutional layer networks was developed
by selecting the StbgTGd2 and ENSTDkCl subsets for digital and acoustic representations
respectively, following the data partitioning examples in Wang et al. [3]. Fifteen pieces of
music from each of these two subsets were randomly selected to represent the testing set,
with the remaining files added to the training set which consists of all the MUS data from
the other seven MAPS subsets. Comparison of the results in Tables 5.1 and 5.2 shows the
three convolutional layer network outperforming the two convolutional layer network with
an F-Measure of 0.7210 and 0.7125 for digital and acoustic music respectively. These
results are promising for an individual network as the metrics approach those of the
ensemble networks in Table 4.7.
Table 5.1: Context Window Results - Two Convolutional Layers
Iteration Accuracy Precision Recall F-Measure
Digital
10 0.1787 0.7101 0.6223 0.6247
20 0.2109 0.7187 0.7017 0.6758
30 0.2158 0.7148 0.7184 0.6835
40 0.2328 0.7302 0.7156 0.6904
50 0.1780 0.6757 0.7393 0.6710
Acoustic
10 0.1535 0.6971 0.6021 0.6099
20 0.2088 0.7287 0.7081 0.6866
30 0.2005 0.7069 0.7334 0.6889
40 0.2281 0.7310 0.7234 0.6967
50 0.1796 0.6765 0.7434 0.6765
36
Table 5.2: Context Window Results - Three Convolutional Layers
Iteration Accuracy Precision Recall F-Measure
Digital
10 0.1643 0.6931 0.7053 0.6612
20 0.2352 0.7479 0.6418 0.6560
30 0.2633 0.7388 0.7458 0.7135
40 0.3334 0.8070 0.6787 0.7104
50 0.2999 0.7614 0.7354 0.7210
Acoustic
10 0.1746 0.7205 0.6963 0.6775
20 0.1914 0.7410 0.6101 0.6357
30 0.2372 0.7413 0.7292 0.7064
40 0.2680 0.8011 0.6511 0.6873
50 0.2610 0.7572 0.7262 0.7125
1 LSTM Network Performance
To test the three convolutional layer model, referred to as the LSTM network rigorously,
five datasets were developed for final testing. Maintaining a similar standard to the initial
network test, the StbgTGd2 and ENSTDkCl subsets were used for digital and acoustic
testing. These two subsets share the same thirty pieces of music recorded on different
instruments and in different conditions. This condition was leveraged by selecting a
random subset of half the thirty pieces of music and including both the StbgTGd2 and
ENSTDkCl recordings of that piece within the testing set, for a total testing set size of
thirty or 11% of the total MAPS music corpus. The five test set partitions are described in
Table 5.3.
The files within the StbgTGd2 and ENSTDkCl subsets not included in the testing set were
combined with the music files from the remaining seven MAPS subsets to create the
training set. A validation set was not created for any dataset due to the limited number of
total music pieces available for training. Instead each network was trained for 100
iterations without the use of early stopping. A snapshot was taken of the network model
37
Table 5.3: Test Set Partitions
Dataset 1
bor ps6 chpn-p19 deb clai deb menu liz et6
liz et trans5 mz 331 2 mz 331 3 mz 333 2 mz 545 3
schuim-1 scn15 11 scn16 3 scn16 4 ty mai
Dataset 2
alb se2 bk xmas5 chpn-p19 grieg butterfly liz rhap09
mz 331 3 mz 332 2 mz 333 2 mz 333 3 mz 545 3
mz 570 1 pathetique 1 schu 143 3 schuim-1 scn15 11
Dataset 3
alb se2 bk xmas4 bor ps6 deb clai liz et trans5
liz rhap09 mz 545 3 mz 570 1 pathetique 1 scn15 11
scn15 12 scn16 3 scn16 4 ty maerz ty mai
Dataset 4
alb se2 bk xmas1 bk xmas4 bor ps6 chpn-p19
deb clai deb menu liz et6 mz 332 2 mz 333 2
mz 333 3 mz 545 3 mz 570 1 schuim-1 ty maerz
Dataset 5
bk xmas1 bk xmas4 bor ps6 chpn-e01 deb menu
grieg butterfly liz et trans5 liz rhap09 mz 333 3 mz 545 3
pathetique 1 schuim-1 scn15 11 scn15 12 ty maerz
before training, and at every five iterations until training was completed. Transcriptions
were determined by a threshold of 0.5. Complete testing results for the LSTM network are
provided in Appendix A, with Table 5.4 summarizing the network iterations with the
highest F-Measure for each dataset.
Table 5.4: LSTM Network Result Summary
Dataset Iteration Accuracy Precision Recall F-Measure
Digital
1 100 0.3568 0.7797 0.7958 0.7628
2 60 0.3428 0.7702 0.7908 0.7556
3 100 0.3715 0.8027 0.6945 0.7195
4 90 0.3176 0.7693 0.8431 0.7793
5 85 0.3306 0.7672 0.7662 0.7427
Acoustic
1 90 0.2917 0.7341 0.7784 0.7275
2 60 0.2955 0.7458 0.7331 0.7113
3 95 0.2538 0.7044 0.7406 0.6935
4 90 0.3020 0.7531 0.8248 0.7614
5 85 0.3090 0.7423 0.7233 0.7060
38
2 BLSTM Network Performance
The final improvement investigated for single AMT neural networks is the
implementation of Bidirectional LSTM (BLSTM) layers within the Music Language
Model. BLSTMs are a form of Bidirectional Recurrent Neural Network (BRNN) shown to
improve speech recognition and machine translation tasks which are similar in nature to
automatic music transcription [25][26]. BRNNs allow networks to learn representations
from both the past and future of the sequence by implementing two RNN layers in parallel
where the first RNN processes the input sequence in order, the second RNN processes the
input in reverse order as shown in Figure 5.4.
Figure 5.4: Bidirectional Recurrent Neural Network
Forward propagation in BRNNs is conducted in two stages where the input is processed in
its entirety from the initial frame until the end of the sequence is reached, after which the
process is repeated in reverse starting at the end of the sequence and progressing towards
the beginning. After both stages are complete, the output is computed and
backpropagation is conducted [27]. The bidirectional implementation provided by the
Keras library was utilized to replace the LSTM in the Music Language Model with a
BLSTM. The resultant BLSTM network model was trained on the dataset partitions in
Table 5.3 with complete testing results available in Appendix B. Table 5.5 lists the model
iteration with the highest F-Measure for each dataset.
39
Table 5.5: BLSTM Network Result Summary
Dataset Iteration Accuracy Precision Recall F-Measure
Digital
1 100 0.4608 0.8248 0.8124 0.7990
2 80 0.4537 0.8110 0.8085 0.7901
3 95 0.4168 0.8022 0.7664 0.7626
4 100 0.3834 0.7946 0.8628 0.8055
5 95 0.3667 0.7813 0.7898 0.7635
Acoustic
1 95 0.3982 0.7891 0.7906 0.7670
2 75 0.3957 0.7820 0.7703 0.7537
3 70 0.3358 0.7621 0.7536 0.7338
4 100 0.3501 0.7678 0.8494 0.7838
5 95 0.3695 0.7598 0.7671 0.7404
Of particular note in Table 5.5 is nearly all of the F-Measures reported for the digital and
acoustic datasets outperformed the F-Measure of 0.7476 reported in Sigtia et al. [4].
Additionally, the acoustic F-Measures outperformed the 0.7451 on the ENSTDkCl subset
reported by Wang et al. [3].
3 Ensemble Performance
To achieve the highest possible frame by frame F-Measure, network ensembles were
created by selected in two highest performing model iterations in each of the LSTM and
BLSTM networks. Keras was used to load the trained models which were given input data
in parallel, with the network outputs being processed by a averaging layer. Table 5.6
outlines the ensemble network’s structure, listing the per model iteration snapshots used
for each dataset. A complete report of per file and overall evaluation metrics is provided in
Appendix C, with Table 5.7 summarizing the overall results for each dataset.
The network ensemble results show overall improvement in F-Measure over the BRNN
networks between 0.099 - 0.219 and 0.015 - 0.219 for digital and acoustic testing sets
40
Table 5.6: Ensemble Network Contents
Model Iteration
Dataset LSTM BLSTM
1 100 90 100 90
2 95 100 80 75
3 100 85 95 70
4 90 85 100 95
5 85 60 95 75
Table 5.7: Ensemble Network Result Summary
Dataset Accuracy Precision Recall F-Measure
Digital
1 0.4899 0.8372 0.8282 0.8135
2 0.4930 0.8287 0.8118 0.8000
3 0.4612 0.8306 0.7633 0.7745
4 0.4690 0.8457 0.8490 0.8274
5 0.4176 0.8132 0.7844 0.7779
Acoustic
1 0.4291 0.8135 0.7910 0.7804
2 0.4296 0.8067 0.7615 0.7618
3 0.3907 0.8081 0.7154 0.7353
4 0.4296 0.8246 0.8284 0.8057
5 0.4093 0.7960 0.7501 0.7501
respectively dependent on dataset. Analyzing the individual file performance in Appendix
C reveals a wide variance across files outlined in Table 5.8. The general consistency
between how files performed across datasets and the ensemble network implies that
compositional complexity shown in Figure 5.6 could contribute significantly to
transcription accuracy, with datasets containing a higher number of complex pieces seeing
reduces overall accuracy.
41
Table 5.8: Ensemble Best and Worst File Results
Worst Best
Dataset File F-Measure File F-Measure
Digital
1 mz 545 3 0.6829 scn16 4 0.9131
2 mz 545 3 0.6826 mz 333 2 0.9178
3 liz rhap09 0.6256 scn16 4 0.9004
4 mz 545 3 0.6792 mz 333 2 0.9199
5 liz rhap09 0.6433 ty maerz 0.8921
Acoustic
1 liz et trans5 0.6006 scn16 4 0.9050
2 liz rhap09 0.5932 mz 333 2 0.8906
3 liz et trans5 0.5406 scn16 4 0.8997
4 mz 545 3 0.6731 mz 333 2 0.8919
5 liz et trans5 0.5866 bk xmas1 0.8756
Figure 5.5: Transcription Error: Acoustic Dataset 1 - scn16 4
Figure 5.7 provides a comparison between the acoustic transcription and ground truth for
the first minute of ’scn16 4’s produced by the ensemble trained on Dataset 1, which
reported an overall F-Measure of 0.9050. While the transcription closely resembles the
42
ground truth, visual analysis can identify several complete false positives, spurious notes,
and long notes that have been incorrectly segmented. Figure 5.5 presents a visualization of
the errors related to the transcription in Figure 5.7, with false positives in red and false
negatives in blue. Most of the false positives in this example can be classified as spurious
notes or incorrect offset determination, where notes were held too long. Still some
extended false positive notes exist and indicate a failure in the MPE stage. False negatives
affecting the recall metric in this example fall into two categories, incorrect offset
determination where notes were cut off too early and long note segmentation errors where
sustained notes are broken into several smaller notes, with the former accounting for the
sustained false negatives.
The investigation and implementation of AMT neural networks has demonstrated steady
improvement culminating in the LSTM/BLSTM Ensemble, which outperforms
state-of-the-art techniques. Table 5.9 outlines the best results for the network models and
data partition combinations in the order of development. Important features which
advanced and improved transcription accuracy, were the inclusion of music within the
training data, deeper convolutional networks in the MPE stage, network ensembling, and
the implementation of a Bidirectional LSTM in the NT stage.
Table 5.9: Network Result Summary
Network Model Data Partition F-Measure
Basic AMT Notes & Chords 0.1015
Basic AMT LSET (Chords and Music) 0.6857
3La 5It Lset LSET (Chords and Music) 0.6871
Ens Miter Lset LSET (Chords and Music) 0.7230
Two Convolutional Layers Frame Context Window 0.6967
LSTM (Three Convolutional Layers) Frame Context Window 0.7125
LSTM (Three Convolutional Layers) Dataset 4 0.7614
BLSTM Dataset 4 0.7838
LSTM/BLSTM Ensemble Dataset 4 0.8057
43
Figure 5.6: Comparison of Composition Complexity
44
Figure 5.7: Label and Transcription Comparison: scn16 4
45
CHAPTER VI
CONCLUSION
This thesis has presented a background on Automatic Music Transcription covering
polyphonic piano music, which remains an open problem in the field of MIR. This task
has the goal of converting an audio signal into an accurate high-level representation of the
music being played. Polyphonic AMT is challenging due to the variable nature of audio
input and note interactions within the time-frequency domain. Several network and dataset
configurations based on Sigtia et al. and Wang et al. were investigated [4][3]. The LSTM,
BLSTM, and Ensemble networks developed within are proposed improvements over
existing methods for the Polyphonic AMT task.
Table 6.1: State of the Art Comparison
Precision Recall F-Measure
Sigtia et al. 2016 [4] 0.7270 0.7694 0.7476
Wang et al. 2018 [3] 0.7009 0.7952 0.7451
Liu et al. 2018 [7] – – 0.6502
Valero et al. 2018 [1] – – 0.70
LSTM 0.7531 0.8248 0.7614
BLSTM 0.7678 0.8494 0.7838
Ensemble 0.8246 0.8286 0.8057
Table 6.1 compares frame based state-of-the-art transcription metrics for acoustic piano
pieces. Comparison of acoustic results is more applicable to real world situations where
instrument sound and quality are variable. When comparing against the reported
measurements for Sigtia and Wang on which the networks were based, the proposed
LSTM BLSTM Ensemble yields an improvement in F-Measure of 0.581, with each of the
component networks also outperforming to a lesser extent. This improvement can be
46
attributed to the increase in MPE stage depth with additional convolutional blocks, the
substitution of the LSTM for the BLSTM which provided knowledge of future sequence
information at each frame, and finally network ensembling.
1 Future Work
After review of the results obtained, significant room for network improvement remains
offering several opportunities for future work. Inclusion of note onset and offset
information into NT stage has been shown to improve transcription accuracy in
state-of-the-art methods presenting an opportunity to further improve upon the LSTM and
BLSTM designs [1][3][7]. Additionally, application of post-processing filters used in the
binarization process, to remove spurious and over segmented notes, by Valero-Mas et al.
could be investigated. Review of the transcription results produced by LSTM and BLSTM
ensemble revealed erroneous notes indicating room for improvement in the MPE stage.
Research into new network models, application of machine translation and speech
recognition techniques may provide possible solutions. When developing new network
models, Attention Networks (ANs) should be considered as such networks have shown
promising results in sequence to sequence models for speech recognition [28]. To the best
of the author’s knowledge, these network models have not been applied to the AMT task
at the time of writing. Finally, while the MAPS dataset is robust it contains a limited
number of music pieces isolated to the classical music genre; the development of a more
robust dataset may contribute to improvements polyphonic AMT accuracy.
47
REFERENCES
[1] J. J. Valero-Mas, E. Benetos, and J. M. Inesta, “A supervised classification approach
for note tracking in polyphonic piano transcription,” Journal of New Music Research,
pp. 1–15, 2018.
[2] A. P. Klapuri, “Automatic music transcription as we know it today,” Journal of New
Music Research, vol. 33, no. 3, pp. 269–282, 2004.
[3] Q. Wang, R. Zhou, and Y. Yan, “Polyphonic piano transcription with a note-based
music language model,” Applied Sciences, vol. 8, no. 3, p. 470, 2018.
[4] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network for polyphonic
piano music transcription,” IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 24, no. 5, pp. 927–939, 2016.
[5] C. O’Brien and M. D. Plumbley, “Automatic music transcription using low rank
non-negative matrix decomposition,” in Signal Processing Conference (EUSIPCO),
2017 25th European, pp. 1848–1852, IEEE, 2017.
[6] P. Smaragdis, B. Raj, and M. Shashanka, “A probabilistic latent variable model for
acoustic modeling,” Advances in models for acoustic processing, NIPS, vol. 148,
pp. 8–1, 2006.
[7] S. Liu, L. Guo, G. A. Wiggins, et al., “A parallel fusion approach to piano music
transcription based on convolutional neural network,” in 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 391–395,
IEEE, 2018.
[8] E. Benetos, T. Weyde, et al., “An efficient temporally-constrained probabilistic
model for multiple-instrument music transcription,” International Society for Music
Information Retrieval, 2015.
[9] S. Sigtia, E. Benetos, S. Cherla, T. Weyde, A. Garcez, and S. Dixon, “Rnn-based
music language models for improving automatic music transcription,” International
Society for Music Information Retrieval, 2014.
[10] S. Sigtia, E. Benetos, N. Boulanger-Lewandowski, T. Weyde, A. S. d. Garcez, and
S. Dixon, “A hybrid recurrent neural network for music transcription,” in 2015 IEEE
International Conference onAcoustics, Speech and Signal Processing (ICASSP),
pp. 2061–2065, IEEE, 2015.
[11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by
back-propagating errors,” Nature, vol. 323, no. 6088, p. 533, 1986.
48
[12] A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint
arXiv:1308.0850, 2013.
[13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: a simple way to prevent neural networks from overfitting,” The Journal of
Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[14] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training
by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
[15] F. Chollet et al., “Keras.” https://keras.io, 2015.
[16] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[17] V. Emiya, R. Badeau, and B. David, “Multipitch estimation of piano sounds using a
new probabilistic spectral smoothness principle,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 18, no. 6, pp. 1643–1654, 2010.
[18] G. E. Poliner and D. P. Ellis, “A discriminative model for polyphonic piano
transcription,” EURASIP Journal on Advances in Signal Processing, vol. 2007, no. 1,
p. 048317, 2006.
[19] S. Bock and M. Schedl, “Polyphonic piano note transcription with recurrent neural
networks.,” in ICASSP, pp. 121–124, 2012.
[20] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “Rwc music database: Music
genre database and musical instrument sound database,” Johns Hopkins University,
2003.
[21] J. C. Brown, “Calculation of a constant q spectral transform,” The Journal of the
Acoustical Society of America, vol. 89, no. 1, pp. 425–434, 1991.
[22] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto,
“librosa: Audio and music signal analysis in python,” in Proceedings of the 14th
Python in Science Conference, pp. 18–25, 2015.
[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., “Scikit-learn: Machine
learning in python,” Journal of machine learning research, vol. 12, no. Oct,
pp. 2825–2830, 2011.
[24] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 993–1001, 1990.
[25] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent
neural networks,” in IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2013, pp. 6645–6649, IEEE, 2013.
49
[26] M. Sundermeyer, T. Alkhouli, J. Wuebker, and H. Ney, “Translation modeling with
bidirectional recurrent neural networks,” in Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing (EMNLP), pp. 14–25, 2014.
[27] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE
Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[28] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan,
R. J. Weiss, K. Rao, E. Gonina, et al., “State-of-the-art speech recognition with
sequence-to-sequence models,” in 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778, IEEE, 2018.
50
APPENDIX A
LSTM NETWORK RESULTS
Table A.1: Digital Results: Single LSTM - Dataset 1
Iteration Accuracy Precision Recall F-Measure
0 0 0.0442 0.4671 0.0789
5 0.1748 0.7183 0.6734 0.6575
10 0.1426 0.6488 0.7333 0.6536
15 0.1244 0.6277 0.7497 0.649
20 0.2633 0.7533 0.7385 0.7165
25 0.2342 0.7209 0.7674 0.7124
30 0.3598 0.8162 0.753 0.7567
35 0.3339 0.8047 0.7397 0.7437
40 0.2901 0.7645 0.7634 0.7353
45 0.2842 0.7565 0.7571 0.7274
50 0.3389 0.7951 0.7714 0.7575
55 0.274 0.7437 0.7763 0.7307
60 0.3242 0.7774 0.7715 0.7476
65 0.3454 0.8061 0.739 0.7446
70 0.362 0.7966 0.7654 0.7557
75 0.2868 0.7387 0.7744 0.7252
80 0.3388 0.786 0.7666 0.7489
85 0.3385 0.784 0.7755 0.7526
90 0.3173 0.7593 0.8077 0.7576
95 0.2263 0.6913 0.7925 0.7068
100 0.3568 0.7797 0.7958 0.7628
51
Table A.2: Acoustic Results: Single LSTM - Dataset 1
Iteration Accuracy Precision Recall F-Measure
0 0 0.0437 0.4903 0.0786
5 0.1628 0.6898 0.6353 0.6268
10 0.1531 0.6387 0.6915 0.6311
15 0.144 0.6274 0.7094 0.6332
20 0.2475 0.7245 0.6811 0.6712
25 0.2297 0.7051 0.7159 0.6805
30 0.2863 0.7772 0.6662 0.6859
35 0.3053 0.7857 0.6819 0.7003
40 0.26 0.7371 0.7216 0.7001
45 0.2673 0.7347 0.7223 0.6982
50 0.332 0.7823 0.7437 0.7341
55 0.2651 0.723 0.7479 0.7046
60 0.3017 0.7591 0.7333 0.718
65 0.3268 0.7902 0.7044 0.7159
70 0.3028 0.7597 0.7213 0.7096
75 0.2301 0.6911 0.7345 0.6755
80 0.2921 0.7596 0.715 0.707
85 0.3019 0.7513 0.7437 0.718
90 0.2917 0.7341 0.7784 0.7275
95 0.2104 0.6643 0.7588 0.6771
100 0.306 0.7475 0.7599 0.7257
52
Table A.3: Digital Results: Single LSTM - Dataset 2
Iteration Accuracy Precision Recall F-Measure
0 0 0.04 0.3976 0.0704
5 0.2056 0.7452 0.6694 0.67
10 0.2277 0.7486 0.6579 0.6668
15 0.2456 0.7353 0.7498 0.7135
20 0.1953 0.6872 0.7575 0.6879
25 0.1585 0.6307 0.7934 0.6715
30 0.2697 0.7812 0.5402 0.6076
35 0.1773 0.6608 0.7811 0.6836
40 0.125 0.6186 0.7776 0.6546
45 0.1874 0.6701 0.7814 0.6891
50 0.2237 0.6834 0.8045 0.7084
55 0.2207 0.6845 0.8028 0.7078
60 0.3428 0.7702 0.7908 0.7556
65 0.1692 0.6403 0.8118 0.684
70 0.2079 0.679 0.7903 0.6999
75 0.3597 0.7924 0.7322 0.7349
80 0.1611 0.6358 0.8039 0.6763
85 0.2286 0.6935 0.7848 0.7063
90 0.3578 0.8016 0.7158 0.7298
95 0.3365 0.7653 0.7751 0.7448
100 0.3465 0.7736 0.7782 0.7505
53
Table A.4: Acoustic Results: Single LSTM - Dataset 2
Iteration Accuracy Precision Recall F-Measure
0 0 0.0402 0.4058 0.0708
5 0.1968 0.7238 0.6186 0.6347
10 0.247 0.7378 0.6053 0.633
15 0.247 0.7149 0.7034 0.6801
20 0.2112 0.6823 0.7125 0.667
25 0.1813 0.6223 0.7597 0.6541
30 0.2201 0.7134 0.4489 0.5196
35 0.2136 0.6717 0.7391 0.674
40 0.1525 0.6125 0.7275 0.6328
45 0.2232 0.6791 0.7466 0.6829
50 0.2279 0.682 0.7531 0.6862
55 0.2086 0.6588 0.7543 0.6718
60 0.2955 0.7458 0.7331 0.7113
65 0.189 0.629 0.7816 0.6654
70 0.2353 0.6735 0.7508 0.6789
75 0.3046 0.7593 0.6642 0.6783
80 0.1926 0.6436 0.7718 0.6707
85 0.2608 0.6961 0.7446 0.6903
90 0.3132 0.7691 0.6473 0.6739
95 0.3074 0.746 0.7181 0.7033
100 0.3104 0.7468 0.7231 0.7069
54
Table A.5: Digital Results: Single LSTM - Dataset 3
Iteration Accuracy Precision Recall F-Measure
0 0 0.047 0.4481 0.0827
5 0.1345 0.6451 0.6829 0.6282
10 0.2187 0.738 0.6677 0.6674
15 0.2172 0.7169 0.707 0.6812
20 0.2428 0.7359 0.721 0.6978
25 0.3118 0.8022 0.6597 0.6945
30 0.2977 0.7762 0.7044 0.7109
35 0.2605 0.7455 0.7218 0.7036
40 0.1678 0.6335 0.7658 0.6601
45 0.3379 0.8 0.6882 0.7128
50 0.2507 0.719 0.7309 0.6952
55 0.2135 0.6854 0.7531 0.686
60 0.3346 0.7934 0.7008 0.7174
65 0.3362 0.7993 0.6322 0.6781
70 0.2005 0.6673 0.7542 0.6765
75 0.3291 0.7979 0.6574 0.6928
80 0.2173 0.681 0.7431 0.6783
85 0.3262 0.7816 0.7142 0.7191
90 0.1928 0.6608 0.7384 0.6636
95 0.2603 0.7089 0.7817 0.7146
100 0.3715 0.8027 0.6945 0.7195
55
Table A.6: Acoustic Results: Single LSTM - Dataset 3
Iteration Accuracy Precision Recall F-Measure
0 0 0.0463 0.4329 0.0812
5 0.1365 0.6237 0.6641 0.6109
10 0.2273 0.7255 0.6449 0.6519
15 0.2173 0.71 0.6564 0.652
20 0.2300 0.7172 0.6681 0.6626
25 0.2760 0.7804 0.6175 0.6591
30 0.2717 0.7564 0.6474 0.6688
35 0.2628 0.7449 0.6787 0.6815
40 0.1650 0.6175 0.7326 0.6389
45 0.2895 0.7769 0.6306 0.6668
50 0.2140 0.6856 0.6803 0.6526
55 0.1923 0.6607 0.7079 0.6523
60 0.2891 0.7648 0.6548 0.6761
65 0.2743 0.7619 0.573 0.6234
70 0.1974 0.659 0.7199 0.6564
75 0.2716 0.7579 0.5951 0.6359
80 0.1913 0.6438 0.7029 0.6392
85 0.2767 0.7552 0.6609 0.6749
90 0.1800 0.6498 0.6594 0.6205
95 0.2538 0.7044 0.7406 0.6935
100 0.2972 0.7682 0.6289 0.6621
56
Table A.7: Digital Results: Single LSTM - Dataset 4
Iteration Accuracy Precision Recall F-Measure
0 0 0.0448 0.4601 0.0793
5 0.1382 0.6719 0.7418 0.6698
10 0.2538 0.7886 0.7397 0.7308
15 0.3017 0.8271 0.6774 0.7138
20 0.1204 0.6299 0.8364 0.6826
25 0.2403 0.7438 0.7857 0.7324
30 0.3397 0.843 0.6569 0.7097
35 0.2678 0.7556 0.8021 0.7486
40 0.1562 0.6524 0.8032 0.6831
45 0.3248 0.8255 0.7187 0.7386
50 0.2149 0.7044 0.8385 0.7343
55 0.3758 0.8358 0.7601 0.7705
60 0.1825 0.6745 0.833 0.7128
65 0.3218 0.8039 0.78 0.7631
70 0.292 0.759 0.8241 0.7621
75 0.3061 0.8227 0.6116 0.6698
80 0.348 0.8239 0.7627 0.7648
85 0.3766 0.8265 0.7819 0.7788
90 0.3176 0.7693 0.8431 0.7793
95 0.1685 0.6628 0.8361 0.7041
100 0.4223 0.8557 0.7549 0.7771
57
Table A.8: Acoustic Results: Single LSTM - Dataset 4
Iteration Accuracy Precision Recall F-Measure
0 0 0.0459 0.4790 0.0815
5 0.1480 0.6579 0.7191 0.6552
10 0.2418 0.7769 0.7046 0.7078
15 0.2673 0.7953 0.6362 0.6756
20 0.1467 0.6327 0.8196 0.6833
25 0.2638 0.7482 0.7616 0.7257
30 0.3068 0.8192 0.6368 0.6877
35 0.2643 0.7499 0.7689 0.7316
40 0.1904 0.6668 0.7765 0.6859
45 0.3111 0.807 0.6914 0.7164
50 0.2423 0.7105 0.8102 0.7287
55 0.3436 0.8212 0.7264 0.7439
60 0.2131 0.6802 0.8244 0.7168
65 0.3387 0.8109 0.7571 0.7567
70 0.3004 0.758 0.7908 0.7473
75 0.2392 0.7822 0.5281 0.5975
80 0.3340 0.8077 0.7349 0.7422
85 0.3410 0.8017 0.7517 0.7498
90 0.3020 0.7531 0.8248 0.7614
95 0.2103 0.6855 0.8179 0.7156
100 0.3584 0.8298 0.7064 0.7354
58
Table A.9: Digital Results: Single LSTM - Dataset 5
Iteration Accuracy Precision Recall F-Measure
0 0 0.0547 0.5077 0.0954
5 0.1646 0.7304 0.4480 0.5175
10 0.1864 0.7040 0.4302 0.5011
15 0.129 0.6200 0.7467 0.6433
20 0.1453 0.6283 0.7545 0.6522
25 0.1063 0.5961 0.7429 0.6246
30 0.2800 0.7726 0.6385 0.6691
35 0.1945 0.6798 0.7550 0.6826
40 0.1455 0.6228 0.7733 0.6568
45 0.2958 0.7678 0.7240 0.7180
50 0.2581 0.7306 0.7407 0.7058
55 0.2172 0.6863 0.7802 0.6992
60 0.3100 0.7742 0.7308 0.7247
65 0.1593 0.6470 0.7536 0.6615
70 0.2609 0.7504 0.6907 0.6875
75 0.2295 0.7043 0.7758 0.7082
80 0.1550 0.6314 0.7443 0.6472
85 0.3306 0.7672 0.7662 0.7427
90 0.3472 0.7957 0.7080 0.7235
95 0.2795 0.7503 0.7309 0.7117
100 0.2717 0.7230 0.7776 0.7222
59
Table A.10: Acoustic Results: Single LSTM - Dataset 5
Iteration Accuracy Precision Recall F-Measure
0 0 0.0541 0.5089 0.0946
5 0.1723 0.6986 0.4062 0.4803
10 0.1493 0.6266 0.3428 0.4143
15 0.1571 0.6245 0.7236 0.6390
20 0.1739 0.6198 0.7235 0.6349
25 0.1306 0.5961 0.7026 0.6114
30 0.2840 0.7485 0.6066 0.6397
35 0.2163 0.6762 0.7169 0.6655
40 0.1810 0.6365 0.7357 0.6507
45 0.2906 0.7514 0.6783 0.6843
50 0.2436 0.7043 0.6954 0.6698
55 0.2378 0.6889 0.7276 0.6785
60 0.2898 0.7492 0.6789 0.6845
65 0.1861 0.6468 0.7106 0.6455
70 0.2563 0.7362 0.6583 0.6637
75 0.2602 0.7114 0.7314 0.6928
80 0.1974 0.6497 0.7087 0.6456
85 0.3090 0.7423 0.7233 0.7060
90 0.3095 0.7639 0.6642 0.6815
95 0.2889 0.7410 0.6868 0.6829
100 0.2642 0.7120 0.7272 0.6912
60
APPENDIX B
BLSTM NETWORK RESULTS
Table B.1: Digital Results: Bidirectional LSTM - Dataset 1
Iteration Accuracy Precision Recall F-Measure
0 0 0.0449 0.3929 0.0788
5 0.2046 0.7209 0.4798 0.5435
10 0.2256 0.7022 0.4807 0.5397
15 0.3467 0.7865 0.7506 0.7421
20 0.3816 0.8190 0.7313 0.7471
25 0.3548 0.7844 0.7694 0.7520
30 0.3585 0.7751 0.7901 0.7577
35 0.3653 0.7838 0.7656 0.7490
40 0.3795 0.7942 0.7709 0.7584
45 0.3898 0.7927 0.7826 0.7646
50 0.2345 0.6867 0.8371 0.7251
55 0.4172 0.8059 0.7909 0.7762
60 0.4417 0.8249 0.7974 0.7899
65 0.4376 0.8257 0.7791 0.7805
70 0.3662 0.7724 0.8308 0.7764
75 0.4369 0.8208 0.7804 0.7793
80 0.3937 0.7855 0.8357 0.7878
85 0.4466 0.8302 0.7688 0.7770
90 0.4313 0.8084 0.8261 0.7962
95 0.4370 0.8076 0.8174 0.7913
100 0.4608 0.8248 0.8124 0.7990
61
Table B.2: Acoustic Results: Bidirectional LSTM - Dataset 1
Iteration Accuracy Precision Recall F-Measure
0 0 0.0444 0.3803 0.0774
5 0.1531 0.639 0.3878 0.4521
10 0.1735 0.6066 0.3952 0.4479
15 0.2918 0.7467 0.7069 0.6968
20 0.3265 0.7818 0.6980 0.7079
25 0.3234 0.7546 0.7332 0.7154
30 0.3128 0.7459 0.7417 0.7155
35 0.3285 0.7567 0.7327 0.7169
40 0.3216 0.7526 0.7453 0.7215
45 0.3339 0.7531 0.7539 0.7273
50 0.2502 0.6905 0.8032 0.7138
55 0.3497 0.7629 0.7581 0.7350
60 0.3769 0.7912 0.7687 0.7556
65 0.3817 0.7934 0.7441 0.7433
70 0.3308 0.7503 0.8035 0.7506
75 0.3721 0.7905 0.7376 0.7390
80 0.3671 0.7732 0.8048 0.7659
85 0.3768 0.7908 0.7308 0.7338
90 0.3717 0.7842 0.7927 0.7653
95 0.3982 0.7891 0.7906 0.7670
100 0.3958 0.7913 0.7765 0.7611
62
Table B.3: Digital Results: Bidirectional LSTM - Dataset 2
Iteration Accuracy Precision Recall F-Measure
0 0 0.0378 0.3675 0.0663
5 0.2047 0.7117 0.6273 0.6344
10 0.2446 0.7412 0.5613 0.6081
15 0.251 0.7066 0.7612 0.7041
20 0.2973 0.7433 0.7238 0.7052
25 0.2619 0.6982 0.8215 0.7275
30 0.3051 0.7419 0.7758 0.7313
35 0.3491 0.7748 0.744 0.7335
40 0.3023 0.7207 0.8105 0.7378
45 0.3891 0.7894 0.7125 0.725
50 0.4036 0.7947 0.7539 0.7508
55 0.3959 0.7893 0.7652 0.7543
60 0.3547 0.7637 0.77 0.7418
65 0.3583 0.7598 0.8118 0.7611
70 0.4073 0.8025 0.7636 0.7597
75 0.4379 0.7966 0.8099 0.7832
80 0.4537 0.811 0.8085 0.7901
85 0.4271 0.795 0.8045 0.7788
90 0.4415 0.8022 0.8038 0.7825
95 0.4215 0.7954 0.7943 0.7732
100 0.4007 0.7823 0.8137 0.7754
63
Table B.4: Acoustic Results: Bidirectional LSTM - Dataset 2
Iteration Accuracy Precision Recall F-Measure
0 0 0.035 0.3514 0.0616
5 0.2057 0.6761 0.6114 0.6122
10 0.2564 0.7138 0.5723 0.6045
15 0.2526 0.6837 0.7181 0.6712
20 0.2761 0.7107 0.6882 0.6706
25 0.2433 0.6663 0.7969 0.6982
30 0.3014 0.7259 0.7454 0.7088
35 0.3338 0.7533 0.7214 0.7112
40 0.2967 0.6976 0.7905 0.7142
45 0.3509 0.757 0.7006 0.7021
50 0.3638 0.7673 0.7156 0.7158
55 0.3465 0.7585 0.7206 0.7132
60 0.3224 0.7329 0.7448 0.7126
65 0.3562 0.7563 0.7852 0.7471
70 0.3539 0.7701 0.7241 0.7213
75 0.3957 0.7820 0.7703 0.7537
80 0.3821 0.7759 0.7684 0.7497
85 0.3847 0.7719 0.7647 0.7453
90 0.392 0.7733 0.7613 0.7446
95 0.3798 0.7661 0.7623 0.7408
100 0.3832 0.7675 0.7813 0.7515
64
Table B.5: Digital Results: Bidirectional LSTM - Dataset 3
Iteration Accuracy Precision Recall F-Measure
0 0 0.0469 0.4432 0.0824
5 0.1905 0.6878 0.6982 0.6614
10 0.1043 0.2817 0.1106 0.1457
15 0.2805 0.7435 0.728 0.7078
20 0.2937 0.7525 0.7289 0.7134
25 0.3102 0.7497 0.7564 0.7273
30 0.2734 0.7251 0.7653 0.7162
35 0.3583 0.7754 0.7646 0.7464
40 0.2476 0.6993 0.7053 0.6732
45 0.3718 0.7832 0.7632 0.7510
50 0.3504 0.7658 0.7759 0.7464
55 0.3725 0.7905 0.7451 0.7432
60 0.3322 0.7481 0.7747 0.7362
65 0.3912 0.799 0.7129 0.7298
70 0.38 0.7799 0.7899 0.7623
75 0.3893 0.7829 0.7718 0.7545
80 0.4058 0.7929 0.7628 0.7553
85 0.3916 0.7756 0.7909 0.7616
90 0.3703 0.7633 0.8023 0.7601
95 0.4168 0.8022 0.7664 0.7626
100 0.4225 0.7978 0.7669 0.7605
65
Table B.6: Acoustic Results: Bidirectional LSTM - Dataset 3
Iteration Accuracy Precision Recall F-Measure
0 0 0.0468 0.4515 0.0825
5 0.1793 0.6564 0.6606 0.6278
10 0.0895 0.1947 0.0746 0.099
15 0.2364 0.7026 0.6979 0.6727
20 0.2685 0.728 0.7046 0.6891
25 0.2721 0.7171 0.7263 0.6951
30 0.2654 0.7125 0.742 0.7003
35 0.3155 0.7466 0.7437 0.7195
40 0.1862 0.6244 0.659 0.6099
45 0.3321 0.7555 0.7369 0.7219
50 0.3101 0.7474 0.7369 0.7172
55 0.3443 0.7726 0.7218 0.7216
60 0.3056 0.7344 0.7528 0.7184
65 0.3363 0.771 0.6876 0.7007
70 0.3358 0.7621 0.7536 0.7338
75 0.3406 0.7624 0.7429 0.7273
80 0.3413 0.7663 0.7231 0.7193
85 0.3367 0.7582 0.7478 0.7287
90 0.3235 0.7369 0.7771 0.7326
95 0.3581 0.774 0.7402 0.7332
100 0.3496 0.7656 0.7344 0.7256
66
Table B.7: Digital Results: Bidirectional LSTM - Dataset 4
Iteration Accuracy Precision Recall F-Measure
0 0 0.05 0.4985 0.0884
5 0.2307 0.783 0.6188 0.6594
10 0.269 0.7751 0.7379 0.7257
15 0.2396 0.7361 0.8024 0.7374
20 0.3229 0.8074 0.7527 0.7514
25 0.3419 0.8053 0.7846 0.7688
30 0.3143 0.7863 0.7827 0.7582
35 0.3779 0.8271 0.7504 0.7625
40 0.3543 0.8078 0.8103 0.7834
45 0.4074 0.8352 0.7905 0.7899
50 0.3861 0.8172 0.8033 0.7868
55 0.389 0.814 0.8161 0.7914
60 0.4211 0.8453 0.7921 0.7953
65 0.3773 0.8124 0.8226 0.7934
70 0.4277 0.8392 0.8092 0.8021
75 0.3683 0.8022 0.8356 0.7947
80 0.3261 0.7789 0.831 0.7771
85 0.4553 0.8609 0.7853 0.8008
90 0.3914 0.8057 0.8465 0.8023
95 0.4139 0.8212 0.8341 0.8052
100 0.3834 0.7946 0.8628 0.8055
67
Table B.8: Acoustic Results: Bidirectional LSTM - Dataset 4
Iteration Accuracy Precision Recall F-Measure
0 0 0.048 0.4932 0.0849
5 0.2321 0.748 0.6196 0.6476
10 0.2677 0.7515 0.73 0.7129
15 0.2411 0.723 0.7759 0.7205
20 0.308 0.7739 0.7461 0.7333
25 0.3122 0.7791 0.7557 0.7413
30 0.2992 0.7592 0.7628 0.7342
35 0.3619 0.8097 0.7315 0.7432
40 0.3611 0.8018 0.7994 0.7768
45 0.3702 0.8085 0.7663 0.762
50 0.3496 0.7953 0.7704 0.7586
55 0.3454 0.7877 0.7927 0.7658
60 0.4012 0.831 0.7656 0.7731
65 0.3432 0.7877 0.8027 0.7709
70 0.3758 0.8082 0.7842 0.7729
75 0.3687 0.7935 0.8171 0.7821
80 0.3328 0.7773 0.8106 0.7686
85 0.405 0.8347 0.7631 0.7742
90 0.3565 0.7852 0.82 0.7788
95 0.3661 0.7949 0.8121 0.7804
100 0.3501 0.7678 0.8494 0.7838
68
Table B.9: Digital Results: Bidirectional LSTM - Dataset 5
Iteration Accuracy Precision Recall F-Measure
0 0 0.0468 0.4954 0.0829
5 0.1967 0.7341 0.5931 0.6202
10 0.1931 0.6905 0.7486 0.6871
15 0.1932 0.6908 0.7219 0.6721
20 0.2504 0.7178 0.7573 0.7076
25 0.2232 0.6895 0.7759 0.6968
30 0.3004 0.7543 0.7621 0.7319
35 0.3459 0.7763 0.765 0.7467
40 0.3308 0.7738 0.765 0.7443
45 0.3317 0.7618 0.773 0.7429
50 0.295 0.7375 0.7797 0.7308
55 0.3104 0.7535 0.7723 0.7368
60 0.3691 0.7875 0.7455 0.7418
65 0.3537 0.7716 0.7824 0.7536
70 0.3187 0.7588 0.766 0.7370
75 0.3409 0.7703 0.7748 0.7480
80 0.2921 0.7304 0.7961 0.7342
85 0.3482 0.7736 0.7575 0.7413
90 0.3452 0.7786 0.7586 0.7439
95 0.3667 0.7813 0.7898 0.7635
100 0.3648 0.7796 0.7586 0.7455
69
Table B.10: Acoustic Results: Bidirectional LSTM - Dataset 5
Iteration Accuracy Precision Recall F-Measure
0 0 0.0446 0.4714 0.0790
5 0.2032 0.6977 0.5573 0.5875
10 0.2227 0.6841 0.7267 0.6763
15 0.2113 0.6809 0.6999 0.6607
20 0.2647 0.7076 0.7225 0.6873
25 0.2508 0.6848 0.7582 0.6905
30 0.2981 0.7261 0.7328 0.7034
35 0.3193 0.7521 0.7171 0.7081
40 0.3302 0.7563 0.7383 0.7220
45 0.3148 0.7321 0.7415 0.7111
50 0.3099 0.7243 0.7433 0.7075
55 0.3332 0.7475 0.7437 0.7210
60 0.3382 0.7588 0.7008 0.7029
65 0.3506 0.7523 0.7461 0.7257
70 0.3256 0.7461 0.7300 0.7131
75 0.3343 0.7452 0.7501 0.7229
80 0.3212 0.7280 0.7571 0.7165
85 0.3421 0.7497 0.7270 0.7126
90 0.3532 0.7585 0.7282 0.7187
95 0.3695 0.7598 0.7671 0.7404
100 0.3393 0.7549 0.7193 0.7111
70
APPENDIX C
ENSEMBLE NETWORK RESULTS
Table C.1: Digital Results: Ensemble - Dataset 1
File Accuracy Precision Recall F-Measure
bor ps6 0.5590 0.8535 0.9240 0.8711
chpn-p19 0.2673 0.8162 0.7746 0.7705
deb clai 0.2243 0.8897 0.7574 0.7922
deb menu 0.3356 0.7675 0.7513 0.7402
liz et6 0.3170 0.7718 0.6905 0.6990
liz et trans5 0.2074 0.7580 0.6803 0.6844
mz 331 2 0.6984 0.8951 0.9079 0.8906
mz 331 3 0.4805 0.7182 0.7887 0.7297
mz 333 2 0.7090 0.9120 0.9525 0.9217
mz 545 3 0.6609 0.6781 0.7127 0.6829
schuim-1 0.4335 0.8543 0.8482 0.8315
scn15 11 0.4565 0.8324 0.8526 0.8222
scn16 3 0.2428 0.7733 0.7716 0.7476
scn16 4 0.7670 0.9251 0.9139 0.9131
ty mai 0.7521 0.8947 0.9072 0.8930
Overall 0.4899 0.8372 0.8282 0.8135
71
Table C.2: Acoustic Results: Ensemble - Dataset 1
File Accuracy Precision Recall F-Measure
bor ps6 0.4026 0.8202 0.9157 0.8512
chpn-p19 0.2200 0.7794 0.7291 0.7212
deb clai 0.2030 0.8402 0.7810 0.7797
deb menu 0.3281 0.7506 0.7355 0.7232
liz et6 0.2985 0.7528 0.6653 0.6798
liz et trans5 0.1513 0.7173 0.5638 0.6006
mz 331 2 0.5432 0.8446 0.8751 0.8433
mz 331 3 0.4176 0.6675 0.7361 0.6733
mz 333 2 0.6384 0.8987 0.9171 0.8954
mz 545 3 0.6630 0.6730 0.6961 0.6726
schuim-1 0.3884 0.8442 0.7565 0.7704
scn15 11 0.4529 0.8503 0.7967 0.8009
scn16 3 0.2568 0.7669 0.7527 0.7378
scn16 4 0.6281 0.9141 0.9116 0.9050
ty mai 0.6451 0.8688 0.8757 0.8610
Overall 0.4291 0.8135 0.7910 0.7804
Table C.3: Digital Results: Ensemble - Dataset 2
File Accuracy Precision Recall F-Measure
alb se2 0.4082 0.9261 0.8583 0.8760
bk xmas5 0.4568 0.9100 0.8629 0.8737
chpn-p19 0.2653 0.8415 0.7270 0.7556
grieg butterfly 0.5977 0.8724 0.8487 0.8434
liz rhap09 0.3869 0.7045 0.6112 0.6311
mz 331 3 0.4714 0.7107 0.7725 0.7184
mz 332 2 0.6380 0.9181 0.9162 0.9066
mz 333 2 0.6966 0.9102 0.9459 0.9178
mz 333 3 0.5761 0.8083 0.8978 0.8321
mz 545 3 0.6828 0.6785 0.7116 0.6826
mz 570 1 0.5848 0.8438 0.8358 0.8194
pathetique 1 0.4236 0.7938 0.8075 0.7799
schu 143 3 0.3244 0.8222 0.7152 0.7422
schuim-1 0.4172 0.8505 0.8277 0.8187
scn15 11 0.5138 0.8531 0.8573 0.8337
Overall 0.4930 0.8287 0.8118 0.8000
72
Table C.4: Acoustic Results: Ensemble - Dataset 2
File Accuracy Precision Recall F-Measure
alb se2 0.2901 0.8721 0.8522 0.8460
bk xmas5 0.4187 0.8938 0.8762 0.8746
chpn-p19 0.2283 0.7961 0.6871 0.7049
grieg butterfly 0.5291 0.8566 0.8387 0.8305
liz rhap09 0.3582 0.6705 0.5692 0.5932
mz 331 3 0.4178 0.6420 0.7029 0.6396
mz 332 2 0.4128 0.8717 0.8301 0.8312
mz 333 2 0.6342 0.9030 0.9039 0.8906
mz 333 3 0.5685 0.8038 0.8628 0.8155
mz 545 3 0.6547 0.6668 0.6794 0.6604
mz 570 1 0.4811 0.8248 0.7706 0.7727
pathetique 1 0.3693 0.7857 0.7283 0.7333
schu 143 3 0.2733 0.7947 0.6821 0.7114
schuim-1 0.3722 0.8399 0.7249 0.7487
scn15 11 0.4374 0.8543 0.7791 0.7936
Overall 0.4296 0.8067 0.7615 0.7618
Table C.5: Digital Results: Ensemble - Dataset 3
File Accuracy Precision Recall F-Measure
alb se2 0.4196 0.9291 0.8498 0.8721
bk xmas4 0.4395 0.8833 0.8049 0.8196
bor ps6 0.4790 0.8561 0.8974 0.8590
deb clai 0.2217 0.9128 0.7122 0.7748
liz et trans5 0.1922 0.7755 0.5842 0.6320
liz rhap09 0.3947 0.7120 0.5979 0.6256
mz 545 3 0.7217 0.6932 0.6964 0.6845
mz 570 1 0.6004 0.8529 0.8225 0.8184
pathetique 1 0.4419 0.8066 0.7937 0.7814
scn15 11 0.5118 0.8676 0.8254 0.8283
scn15 12 0.2749 0.8980 0.6989 0.7716
scn16 3 0.2448 0.7948 0.7409 0.7436
scn16 4 0.7486 0.9247 0.8959 0.9004
ty maerz 0.6585 0.9089 0.9088 0.8993
ty mai 0.7864 0.9042 0.9044 0.8970
Overall 0.4612 0.8306 0.7633 0.7745
73
Table C.6: Acoustic Results: Ensemble - Dataset 3
File Accuracy Precision Recall F-Measure
alb se2 0.3099 0.8902 0.8361 0.8443
bk xmas4 0.4307 0.8668 0.8257 0.8262
bor ps6 0.4726 0.8538 0.8909 0.8578
deb clai 0.1924 0.8695 0.7403 0.7685
liz et trans5 0.1441 0.7336 0.4674 0.5406
liz rhap09 0.3518 0.6782 0.5394 0.5762
mz 545 3 0.6764 0.6764 0.6642 0.6578
mz 570 1 0.4871 0.8341 0.7549 0.7682
pathetique 1 0.3769 0.7930 0.6939 0.7169
scn15 11 0.4247 0.8561 0.7598 0.7830
scn15 12 0.1686 0.8575 0.6151 0.6911
scn16 3 0.245 0.7883 0.7110 0.7249
scn16 4 0.6304 0.9296 0.8903 0.8997
ty maerz 0.4690 0.8915 0.8648 0.8601
ty mai 0.6337 0.8806 0.8585 0.8586
Overall 0.3907 0.8081 0.7154 0.7353
Table C.7: Digital Results: Ensemble - Dataset 4
File Accuracy Precision Recall F-Measure
alb se2 0.3894 0.9050 0.8816 0.8782
bk xmas1 0.3643 0.9011 0.8349 0.8490
bk xmas4 0.4097 0.8589 0.8305 0.8220
bor ps6 0.5238 0.8589 0.9104 0.8685
chpn-p19 0.2547 0.8345 0.7659 0.7736
deb clai 0.1855 0.8920 0.7220 0.7708
deb menu 0.3163 0.7571 0.7674 0.7426
liz et6 0.3282 0.7710 0.6960 0.7005
mz 332 2 0.6123 0.9042 0.9251 0.9032
mz 333 2 0.6905 0.9023 0.9601 0.9199
mz 333 3 0.5134 0.7769 0.9135 0.8195
mz 545 3 0.6517 0.6624 0.7235 0.6792
mz 570 1 0.5719 0.8324 0.8502 0.8221
schuim-1 0.4224 0.8371 0.8585 0.8293
ty maerz 0.6140 0.8808 0.9257 0.8917
Overall 0.4690 0.8457 0.8490 0.8274
74
Table C.8: Acoustic Results: Ensemble - Dataset 4
File Accuracy Precision Recall F-Measure
alb se2 0.2948 0.8404 0.8796 0.8429
bk xmas1 0.4623 0.8716 0.9058 0.8757
bk xmas4 0.3734 0.8301 0.8654 0.8297
bor ps6 0.5106 0.8424 0.9047 0.8579
chpn-p19 0.2220 0.7870 0.7259 0.7236
deb clai 0.2030 0.8569 0.7722 0.7839
deb menu 0.3293 0.7287 0.7642 0.7271
liz et6 0.2955 0.7424 0.6881 0.6882
mz 332 2 0.4390 0.8642 0.8475 0.8372
mz 333 2 0.6280 0.8927 0.9169 0.8919
mz 333 3 0.5460 0.7889 0.8879 0.8176
mz 545 3 0.6538 0.6628 0.7109 0.6731
mz 570 1 0.4492 0.8072 0.7890 0.7734
schuim-1 0.3981 0.8464 0.7678 0.7780
ty maerz 0.4107 0.8414 0.9108 0.8558
Overall 0.4296 0.8246 0.8284 0.8057
Table C.9: Digital Results: Ensemble - Dataset 5
File Accuracy Precision Recall F-Measure
bk xmas1 0.3617 0.9155 0.8128 0.8441
bk xmas4 0.4477 0.8856 0.8521 0.8559
bor ps6 0.5504 0.8700 0.9056 0.8720
chpn-e01 0.1296 0.8763 0.7338 0.7744
deb menu 0.3262 0.7685 0.7405 0.7365
grieg butterfly 0.5707 0.8735 0.8415 0.8389
liz et trans5 0.1941 0.7705 0.6345 0.6623
liz rhap09 0.3742 0.7008 0.6354 0.6433
mz 333 3 0.5783 0.8138 0.9066 0.8390
mz 545 3 0.6709 0.6801 0.7042 0.6807
pathetique 1 0.4406 0.7940 0.8225 0.7894
schuim-1 0.4137 0.8567 0.8347 0.8252
scn15 11 0.4491 0.8457 0.8371 0.8183
scn15 12 0.2678 0.8805 0.7168 0.7743
ty maerz 0.6070 0.8942 0.9099 0.8921
Overall 0.4176 0.8132 0.7844 0.7779
75
Table C.10: Acoustic Results: Ensemble - Dataset 5
File Accuracy Precision Recall F-Measure
bk xmas1 0.4778 0.8888 0.8884 0.8756
bk xmas4 0.4048 0.8581 0.8562 0.8421
bor ps6 0.5192 0.8611 0.9007 0.8656
chpn-e01 0.1029 0.7778 0.7696 0.7502
deb menu 0.3496 0.7498 0.7424 0.7270
grieg butterfly 0.5625 0.8642 0.8391 0.8355
liz et trans5 0.1493 0.7422 0.5302 0.5866
liz rhap09 0.3685 0.6724 0.5969 0.6089
mz 333 3 0.6164 0.8269 0.8814 0.8371
mz 545 3 0.6632 0.6726 0.6908 0.6696
pathetique 1 0.4090 0.7875 0.7279 0.7345
schuim-1 0.3814 0.8539 0.7476 0.7704
scn15 11 0.4435 0.8512 0.7917 0.8011
scn15 12 0.1943 0.8389 0.6361 0.7010
ty maerz 0.4246 0.8729 0.8684 0.8504
Overall 0.4093 0.7960 0.7501 0.7501
76