NEURAL NETWORKS FOR AUTOMATIC POLYPHONIC PIANO …

NEURAL NETWORKS FOR AUTOMATIC POLYPHONIC PIANO MUSIC

TRANSCRIPTION

by

JOHNATHON MICHAEL ENDER

B.S., University of Wisconsin Madison, 2013

A thesis submitted to the Graduate Faculty of the

University of Colorado Colorado Springs

in partial fulfillment of the

requirements for the degree of

Master of Science

Department of Computer Science

2018

©2018

JOHNATHON MICHAEL ENDER

ALL RIGHTS RESERVED

This thesis for the Master of Computer Science degree by

Johnathon Michael Ender

has been approved for the

Department of Computer Science

by

Jugal Kalita, Chair

Albert Glock

Sudhanshu Semwal

Date: December 11, 2018

ii

Ender, Johnathon Michael (M.S., Computer Science)

Neural Networks for Automatic Polyphonic Piano Music Transcription

Thesis directed by Professor Jugal Kalita

ABSTRACT

Automatic Music Transcription is the process of producing an accurate musical representa-

tion from a polyphonic audio signal and continues to challenge state-of-the -art techniques.

To produce reliable transcriptions, human transcribers must estimate the active notes within

an audio sequence and apply a musical format, a task which requires extensive time and

expertise. While automatic transcription techniques do not produce the same level of ac-

curacy as human transcribers, modern machine learning techniques in the form of neural

network models have recently shown improvement in the area. This thesis investigates the

task of automatic polyphonic music transcription with the implementation and variation of

supervised machine learning models trained on various partitions of the MAPS dataset. The

idea is to generate a frame based pitch activation matrix from an audio sequence which is

subsequently classified into active or inactive note events. The results are presented as iter-

ative improvements on network models, in terms of transcription accuracy, which includes

findings on the implementation of Bidirectional Long Short Term Memory and Network

Ensembling to significantly improve transcription results.

iii

TABLE OF CONTENTS

CHAPTER

I Introduction 1

1 Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

II State of the Art in Polyphonic Piano Music Transcription 7

1 Sigtia et al. 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Wang et al. 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Liu et al. 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Valero-Mas et al. 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

III Developing an AMT network 14

1 AMT Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 The MAPS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

IV Preliminary AMT Network Investigation 22

1 Notes and Chords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Music and Chords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iv

3 Network Ensembling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

V Improving AMT Network Models 33

1 LSTM Network Performance . . . . . . . . . . . . . . . . . . . . . . . . . 37

2 BLSTM Network Performance . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Ensemble Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

VI Conclusion 46

1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

REFERENCES 48

APPENDIX 51

A LSTM Network Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

B BLSTM Network Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

C Ensemble Network Results . . . . . . . . . . . . . . . . . . . . . . . . . . 71

v

LIST OF FIGURES

1.1 Note: CQT and Label Comparison . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Chord: CQT and Label Comparison . . . . . . . . . . . . . . . . . . . . . 4

1.3 Music: CQT and Label Comparison . . . . . . . . . . . . . . . . . . . . . 5

2.1 Network Structure: Sigtia et al. 2016 . . . . . . . . . . . . . . . . . . . . . 9

2.2 Network Structure: Wang et al. 2018 . . . . . . . . . . . . . . . . . . . . . 10

2.3 Network Structure: Liu et al. 2018 . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Network Structure: Valero-Mas et al. 2018 . . . . . . . . . . . . . . . . . . 12

3.1 Basic LSTM AMT Network Structure . . . . . . . . . . . . . . . . . . . . 15

3.2 Label Generation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Notes and Chords Ground Truth: First 2000 frames . . . . . . . . . . . . . 23

4.2 Notes and Chords Transcription: First 2000 frames . . . . . . . . . . . . . 24

4.3 Notes and Chords Transcription Error: First 2000 frames . . . . . . . . . . 24

4.4 LSET Ground Truth: First 2000 Frames . . . . . . . . . . . . . . . . . . . 26

4.5 LSET Transcription: First 2000 Frames . . . . . . . . . . . . . . . . . . . 27

4.6 LSET Transcription Error: First 2000 Frames . . . . . . . . . . . . . . . . 27

4.7 Network Ensembling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.8 3La 5It Lset Transcription Error: First 2000 Frames . . . . . . . . . . . . . 30

4.9 Ens Miter Lset: First 2000 Frames . . . . . . . . . . . . . . . . . . . . . . 31

vi

5.1 Frame Context Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Two and Three Convolutional Layer Networks . . . . . . . . . . . . . . . . 34

5.3 Per File Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.4 Bidirectional Recurrent Neural Network . . . . . . . . . . . . . . . . . . . 39

5.5 Transcription Error: Acoustic Dataset 1 - scn16 4 . . . . . . . . . . . . . . 42

5.6 Comparison of Composition Complexity . . . . . . . . . . . . . . . . . . . 44

5.7 Label and Transcription Comparison: scn16 4 . . . . . . . . . . . . . . . . 45

vii

LIST OF TABLES

3.1 MAPS: Instruments and Recording Conditions . . . . . . . . . . . . . . . 17

3.2 MIDI Event File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Investigation Data Partitions . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Evaluation Metrics: Notes and Chords . . . . . . . . . . . . . . . . . . . . 25

4.3 Evaluation Metrics: LSET . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4 Single Network Model Descriptions . . . . . . . . . . . . . . . . . . . . . 29

4.5 Single Network Model Results . . . . . . . . . . . . . . . . . . . . . . . . 29

4.6 Ensemble Network Contents . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.7 Ensemble Network Results . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1 Context Window Results - Two Convolutional Layers . . . . . . . . . . . . 36

5.2 Context Window Results - Three Convolutional Layers . . . . . . . . . . . 37

5.3 Test Set Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4 LSTM Network Result Summary . . . . . . . . . . . . . . . . . . . . . . . 38

5.5 BLSTM Network Result Summary . . . . . . . . . . . . . . . . . . . . . . 40

5.6 Ensemble Network Contents . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.7 Ensemble Network Result Summary . . . . . . . . . . . . . . . . . . . . . 41

5.8 Ensemble Best and Worst File Results . . . . . . . . . . . . . . . . . . . . 42

5.9 Network Result Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

viii

6.1 State of the Art Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 46

A.1 Digital Results: Single LSTM - Dataset 1 . . . . . . . . . . . . . . . . . . 51

A.2 Acoustic Results: Single LSTM - Dataset 1 . . . . . . . . . . . . . . . . . 52









B.1 Digital Results: Bidirectional LSTM - Dataset 1 . . . . . . . . . . . . . . . 61

B.2 Acoustic Results: Bidirectional LSTM - Dataset 1 . . . . . . . . . . . . . . 62








ix


C.1 Digital Results: Ensemble - Dataset 1 . . . . . . . . . . . . . . . . . . . . 71

C.2 Acoustic Results: Ensemble - Dataset 1 . . . . . . . . . . . . . . . . . . . 72









x

CHAPTER I

INTRODUCTION

Automatic Music Transcription (AMT) is a process which converts an audio signal to a

general purpose high-level symbolic representation of the content. The task of

transcription is a subfield in Music Information Retrieval (MIR), which has studied AMT

extensively due to its applications in music preservation and annotation, music similarity

and retrieval, among others [1]. Producing accurate transcriptions requires extensive time

and expertise when performed by human transcribers who estimate the active notes within

an audio sequence and apply a musical format. To accomplish this, such a transcriber must

be able to accurately identify several aspects within the audio including pitch, tempo, and

rhythm. This often requires an audio segment to be reviewed many times. The complex

and time consuming nature of the transcription task presents opportunity for the utilization

of machine learning techniques as an alternative to human labor. The following sections

provide a background on the composition and complexity of music then outlines the

results and content of this thesis.

1 Music

Piano music is composed of varying note and chord sequences, which represent melodic

and harmonic concepts. In terms of transcription, a note is represented as an individual

pitch with a beginning and an ending referred, to as onset and offset respectively, a chord

is a group of notes which share onset and offset timings. Ae sequence of notes and chords,

known as a melody, is structured by tempo, the speed at which the music is played, and

time signature which represents the rhythm. A melody is further structured by restricting

1

notes and chords to a key, representing a group of related pitches, though composers may

include notes outside of the key for emotional effect. On the piano, each hand may

produce a melody independently producing either monophonic or polyphonic sounds.

Polyphony consists of two or more lines of melody played simultaneously, where

monophony consists of an isolated melody played for a length of time. While automatic

monophonic music transcription is considered a solved problem [2], polyphonic music

continues to be difficult for both human experts and proposed AMT approaches.

Polyphony is characterized by concurrent note overlap in the time domain, which

increases complexity in the frequency domain and the overall audio signal [3]. To

illustrate monophonic and polyphonic music complexity, Figures 1.1, 1.2, and 1.3 provide

comparison between an audio spectrogram and the ground truth transcription for a single

note, the C7 chord on C2, and forty six seconds of classical music respectively.

Figures 1.1 and 1.2 represent monophonic pieces of audio, where the note and chord

signals are uninterrupted, yet illustrate the string resonance, harmonics, and noise that

occur even playing a single note. When visually compared to Figure 1.3, which represents

a polyphonic signal, the increased complexity is apparent. Thus, polyphonic AMT

presents a particularly difficult problem due to audio input signal variability, interaction

between concurrently sounding notes, and the harmonics that are produced by an

instrument such as a piano [4].

2

Figure 1.1: Note: CQT and Label Comparison

3

Figure 1.2: Chord: CQT and Label Comparison

4

Figure 1.3: Music: CQT and Label Comparison

5

2 Thesis Structure

The goal of this thesis is develop an understanding of, and improve upon, neural network

models used in automatic transcription of piano music. Improvement is measured in terms

of transcription accuracy which is produced by a neural network and compared to the

ground truth transcription. The networks discussed in this thesis take a spectrogram as

input and produce a binary representation of active notes at a given time. Figure 1.3 shows

examples of both the spectrogram input and expected neural network output for a piece of

polyphonic piano music. At a high level, accuracy of the output is determined by

comparing with the ground truth determining accurate and inaccurate notes at each time

step, otherwise referred to as a frame, and is discussed in further detail in Chapter 3. The

neural networks developed in this paper surpassed the state-of-the-art accuracy score of

0.7476 with the best of these networks reporting a score of 0.8057 for a 7.77% overall

improvement.

The following chapters outline the development, investigation, and iteration process of the

AMT neural network which outperformed state-of-the-art results. Chapter 2 discusses

background research into AMT neural network structures and state of the art techniques.

Based on that research, the Chapter 3 develops a network model, selects an appropriate

dataset, and outlines methods to evaluate neural network performance. Initial investigation

into the effects of dataset partitions and network structure follow, serving as a basis for

model improvement. In Chapter 4 augmentations to the network input and structure are

explored resulting in transcriptions that surpass state-of-the-art accuracy. Finally, the

thesis is concluded with a comparison of network results with reported state of the art

accuracy scores and a discussion of avenues for future work in AMT.

6

CHAPTER II

STATE OF THE ART IN POLYPHONIC PIANO MUSIC TRANSCRIPTION

Recent approaches to polyphonic AMT systems are composed of two components: an

initial Multipitch Estimation (MPE) stage which produces a non-binary two-dimensional

representation, referred to as a posteriorgram, that represents the active pitch probabilities

for each frame in a signal; and a Note Tracking (NT) stage that refines the MPE output,

acting as a correction step [1]. This two-stage strategy is analogous to speech recognition

methods with MPE strategies representing acoustic models and the NT stage as a Music

Language Model (MLM) which applies structural regularity to the MPE output [3].

The purpose of the MPE stage is to extract features from input data and produce pitch

probability on a per frame basis. There are a variety of common MPE strategies which

take a time-frequency representation, such as a spectrogram, as input. These include

Non-negative Matrix Factorization (NMF) [5], Probabilistic Latent Component Analysis

(PLCA) [6], and discriminative approaches which aim to classify features directly from

raw data or low level representations of audio signals. Recently neural networks have

shown excellent results in both PLCA and discriminative approaches to MPE [3][4][7].

Note tracking is a post-processing step which binarizes the posteriorgram provided by the

MPE stage. This is typically done by applying a threshold to the pitch activations, where

pitch probabilities less than the threshold are considered to be silence, and all others

assumed to be active. The result of this process can be considered a frame level

transcription, with the drawback of sensitivity to the MPE stage where false positives, and

over-segmentation of long note events diminish transcription quality [1]. Other methods

involve a rule-based system where notes shorter than a minimum length are pruned, and

7

small gaps between consecutive notes are filled, and requires such rules to be crafted by

hand, which is undesirable [8]. Similar to MPE stage implementations, neural network

based strategies for note tracking are providing improved transcription accuracy over

previous thresholding and rule-based methods [1][3][4].

Neural network based approaches to AMT are varied, but still retain the two-stage MPE

and NT structure. In 2016 Sigtia et al. proposed an end to end neural network with the

MPE and NT stages connected sequentially and trained in unison. Wang et al. furthered

this process with the addition of a note onset detection network in 2018. That same year

Liu et al. proposed an alternative network model, and Valero-Mas et al. investigated

several classifier and neural network based note tracking methods. In the following

sections each of these approaches are discussed in relative order.

1 Sigtia et al. 2016

Sigtia et al. implemented a two-stage Hidden Markov Model (HMM), a stochastic

statistical model where event probabilities are dependent on unobserved previous states,

for the AMT network which combined both the MPE and NT stages into a single

end-to-end solution. The approach, which continued Sigtia’s previous work in [9] and

[10], utilized a Convolutional Neural Network (CNN) and Recurrent Neural Network

(RNN) as the MPE and NT stages respectively, to improve transcription accuracy. A

visualization of the network structure is provided in Figure 2.1.

A CNN, used in the acoustic model, is a specialized neural network composed of

convolutional layers followed by pooling layers. In a convolutional layer, a set of weights

are defined, referred to as a convolutional kernel, and multiplied by a region of input data

in a grid pattern to produce a feature map which preserves the spatial structure of the input

data. A subsequent pooling layer simplifies the feature map, reducing dimensionality by

8

Figure 2.1: Network Structure: Sigtia et al. 2016

taking values such as regional min, max, and average. The resulting network provides an

effective means of describing changes in local regions of the input data [7].

The RNN, used by Sigtia for note tracking, is a variant of standard feed-forward networks

designed to process sequential data. In addition to the vector input and output, the

recurrent network retains input and output states. These states provide a form of memory

allowing the network to apply previously learned information to each element as the

sequence is processed [11]. The memory provided by recurrent neural networks is

beneficial to improving transcription accuracy due to the highly correlated nature of

pitches in polyphonic music (harmonics, chords) [4].

To produce a binarized transcription, Sigtia et al. utilize a high dimensional hashed beam

search, a heuristic algorithm that traverses a limited number of promising nodes within a

graph, to select note combinations and sequences with the highest probability. This is in

contrast to thresholding methods which consider any note above a specified probability to

be active. The reported F-Measure of 0.7476 for the frame based analysis of the

end-to-end model outperformed existing PLCA and NMF models by 4%-10% and an

9

absolute improvement over existing neural network models by 5%. Sigtia et al.’s work in

2016 continues to be a basis and reference for current neural network AMT models.

2 Wang et al. 2018

Wang et al. augmented the network described by Sigtia et al.’s 2016 work with the

addition of a note onset detection information feeding into the MPE CNN. Onset detection

is performed by a CNN with the same structure as the MPE stage, which was augmented

to include additional convolutional layers, with the resulting high level network

architecture shown in Figure 2.2.

Figure 2.2: Network Structure: Wang et al. 2018

In addition to the onset detection network, the RNN in the NT stage was substituted by

Long Short Term Memory (LSTM). A LSTM cell is an augmented RNN unit which

solves the vanishing and exploding gradient problems with the inclusion of a forget gate.

10

Vanishing and exploding gradients inhibit learning within RNNs and are caused by

previously learned dependencies being held too long. The introduction of the forget gate

allows the LSTM to learn what information to retain through memory during each time

step [12]. Finally, post-processing with a local beam search was compared against global

beam search and thresholding methods for transcription accuracy. Both beam search

binarization methods provided negligible improvement over thresholding with a best

reported F-Measure of 0.7451 for acoustic piano transcription [3].

3 Liu et al. 2018

In an alternate AMT, model Liu et al. proposed a two-channel CNN processing pitch

estimation and note onset in parallel, as shown in 2.3. Each CNN was specialized to its

purpose with the use of differing convolutional kernel and max pooling region sizes.

Figure 2.3: Network Structure: Liu et al. 2018

The outputs of these two networks were combined in the note tracking stage, which

differed from Sigtia and Wang, with the use of a Multilayer Perception (MLP) network

rather than a type of RNN. A MLP network is a feed-forward network with multiple

internal layers; this type of network differs from RNNs as it does not maintain any

11

knowledge of previous inputs. The output of the MLP based note tracking stage was

processed with thresholding resulting in a reported F-Measure of 0.6502 for frame based

transcription of acoustic piano recordings [7].

4 Valero-Mas et al. 2018

To investigate note tracking methods, Valero-Mas et al. proposed a network model where

the acoustic model output consisted of three types. Similarly to the model introduced by

Liu et al., the acoustic model processed note onset and MPE in parallel, but also contained

the addition of a preliminary binarization of the MPE.

Figure 2.4: Network Structure: Valero-Mas et al. 2018

This binarized MPE was produced using thresholding with additional post-processing, to

12

reduce note segmentation and spurious notes, using filters. This variation of the acoustic

model was kept consistent throughout the testing of various NT stages using classifiers,

Support Vector Machines (SVM), and MLP networks. Direct comparison of transcription

F-Measure is conducted against an implementation of Sigtia et al.’s 2016 network which

achieved an F-Measure of 0.65 in testing. Of the note tracking strategies tested,

transcription results from the MLP based NT stage reported the highest F-Measure of 0.70

for frame based evaluations showing an improvement of 5% when perfectly accurate note

onset information is provided to the NT stage.

5 Chapter Summary

The state-of-the-art methods discussed in this chapter share a basic two-stage structure

consisting of an MPE and NT stage respectively. Differences in the MPE stage of each

network include convolutional network organization, and the addition of note onset

information. In these state-of-the-art networks, a variety of NT stages are investigated

including RNN, LSTM, MLP, and SVM implementations. Of these networks, Sigtia et

al.’s has the simplest design and will be used as a basis for initial network development in

Chapter 3.

13

CHAPTER III

DEVELOPING AN AMT NETWORK

To cultivate an understanding of neural network based AMT, a basic model is considered

for initial investigation. The following sections cover the three high level requirements to

investigate and improve AMT networks. First, a network model is developed with

guidance from state-of-the-art methods. Next, an appropriate dataset is selected, with

which to train and test the network model. Finally, evaluation criteria are discussed by

which network transcription accuracy is measured.

1 AMT Network Model

The initial model for AMT investigation was implemented as a two-stage HMM, which

includes the MPE and NT stages in series, based on Sigtia et al.’s work in 2016. This

model was chosen for its simplicity of understanding and implementation, when

compared to deeper models requiring a separate note onset detection model. The MPE

stage, or acoustic model, consists of a single convolutional block taking a sequence of one

dimensional feature vectors as input. A convolutional block refers to a group of

computations, which includes a convolution, followed by a max pooling, and finally

dropout. This convolutional block is followed by flattening and batch normalization layers.

Each vector in the input sequence, discussed further in the following section, is a single

frame of preprocessed audio in the form of a spectrogram that has a frame length of 264.

14

Figure 3.1: Basic LSTM AMT Network Structure

The convolution portion of the convolutional block, as discussed in Chapter 2 Section 1, is

used to spatially map input features to detect and separate individual pitches from the

spectrogram. Sixty four filters with a convolutional kernel of length three, with a stride of

one were used to detect edges within the data. Max pooling was used by Wang et al. to

reduce the convolutional output dimensionality by selecting the maximum value in a

region. For this network a max pooling size of two was selected, reducing the convolution

output size by half [3]. Finishing the convolutional block, a process known as Dropout is

used as an effective measure to prevent overfitting, which occurs when the network output

corresponds too closely to training data. This renders the neural network unable to

accurately predict unseen data in the testing sets [13]. Dropout prevents overfitting of the

15

network by removing hidden nodes based on a implementation specific probability and is

commonly used in state-of-the-art methods [1][3][4]. A dropout probability of 0.25 was

used within the convolutional block. The final two layers in the acoustic model consist of

a flattening stage, which restructures the convolutional output matrix into a one

dimensional vector, and batch normalization layer added to reduce training time [14].

The resulting output of this MPE stage is a pitch activation probability matrix, or

posteriorgram, representing a sequence of activation probabilities for each of the 88 piano

pitches. This is directly input into the NT stage which acts as a smoothing function [4],

and is comprised of a single fully connected LSTM with a dropout value of 0.1 followed

by a dense layer. A fully connected or dense, layer is one where each node in a layer has

connections with every node in the preceding layer. A diagram of the basic AMT network

showing all of the component layers is provided in Figure 3.1.

This network model and all subsequent models discussed in future chapters, were

implemented with the Keras library, which is a high-level neural network API with

support for conducting network computations on a Graphics Processing Unit (GPU) [15].

The binary cross-entropy loss function, provided by Keras, was determined to be the most

appropriate as several output nodes can be active at a time. The Adam optimizer was

chosen to increase training performance [16]. Network models were trained on a Windows

10 machine with an NVIDIA GeForce GTX 980 Ti graphics card containing six gigabytes

of standard DDR5 video memory.

2 The MAPS Dataset

To train and test the proposed network, the MIDI Aligned Piano Sounds (MAPS) dataset

[17] was selected for its completeness and diversity. While other sources of MIDI-aligned

piano audio such as LabROSA [18] or Batik [19] are available, they possess limited scope

16

containing 29 and 13 recordings respectively. Additionally, the MAPS dataset is the basis

for many reported state-of-the-art transcription results [1][3][4][7].

The MAPS dataset contains approximately 40 gigabytes, i.e. about 65 hours, of piano

recordings. This includes audio recordings in .wav format, Musical Instrument Digital

Interface (MIDI) ground truth files, and text files representations of the MIDI events. The

dataset is divided into nine coded subsets according to the instrument and location used to

record the audio as shown in Table 3.1. Each subset is further divided into groups with

each containing a comprehensive set of isolated notes (ISOL), random chords (RAND),

common chords (UCHO), and a limited selection of complete classical music

compositions (MUS).

Table 3.1: MAPS: Instruments and Recording Conditions

Subset Recording Conditions Real Instrument or Software

StbgTGd2 Software default The Grand 2 (Steinberg)

AkPnBsdf Church Acoustic Piano (Native Instruments)

AkPnBcht Concert Hall Acoustic Piano (Native Instruments)

AkPnCGdD Studio Acoustic Piano (Native Instruments)

AkPnStgb Jazz Club Acoustic Piano (Native Instruments)

SptkBGAm “Ambient” The Black Grand (Sampletekk)

SptkBGCl “Close” The Black Grand (Sampletekk)

ENSTDkAm “Ambient” Real Piano (Disklavier)

ENSTDkCl “Close” Real Piano (Disklavier)

Of these sets, ISOL is the most diverse containing single, long, staccato, and repeated

notes. The ISOL recordings also contain sets for chromatic scales and trills. Further all

samples in ISOL contain variations for loudness P (piano), M (mezzo-forte), F (forte) and

usage of the damper pedal, which sustains notes for longer than the key is held and also

increases harmonics within the instrument. RAND provides procedurally generated sets of

random chords containing 2-7 notes without musical knowledge applied, and provides

varying loudness and damper pedal application. The UCHO set provides usual chords

17

from Western music such as classical or jazz in the same manner as RAND. Finally, the

MUS set provides 270 pieces of classical and traditional music generated from standard

MIDI files available from Goto et al. under the Creative Commons license [20]. Each line

in Table 1 contains 30 randomly chosen pieces of music [17].

2.1 Preprocessing

To prepare the audio file for input into the neural network it must first be converted into a

spectrogram representation. The Constant Q Transform (CQT) has been shown to be

fundamentally better suited as a time frequency representation due to the frequency of

music notes being linear in pitch [21]. Additionally, the CQT is frequently chosen in state

of the art methods due to its lower dimensional representation when compared to

Short-Time Fourier Transform (STFT) thus reducing the number of input parameters into

the network [3][4][7].

The librosa library is a Python API which provides audio processing functionality and was

used to compute the CQT for each audio file in the dataset [22]. Each file was sampled at

22.05 kHz and the CQT was computed for each key frequency from A0 to C8, 27.5 Hz to

4186.01 Hz, at 3 bins per key and a hop size of 512 samples. This resulted in a

264-dimensional input vector sequence with a period of 23.6 ms. The CQT can thus be

represented as matrix with shape (264, t) where t is the total number of samples in the

audio input divided by the hop size, the transpose of CQT matrix the taken and saved for

input into the network.

The ground truth labels also required preprocessing for compatibility with the neural

network. A custom function was written to map the MIDI events from a text file, with

format exemplified in Table 3.2, to a label matrix with shape (t, 88) representing the 88

key state. A process is outlined by the algorithm in Figure 3.2 to convert the event file into

18

Table 3.2: MIDI Event File Format

Onset Offset MIDI Pitch

0.500004 2.500006 72

0.500004 2.500006 73

0.500004 3.500006 86

Onset and Offset values are represented in seconds. MIDI pitches range form 21 to 108 for piano.

a label representation.

Figure 3.2: Label Generation Algorithm

function GENERATELABELS(audioF ile, eventF ile, hopSize)numSamples← GETSAMPLENUMBER(audioF ile)timeStep← 1/GETSAMPLERATE(audioF ile)labels← [numSamples, 88]for all lines in eventF ile do

start← line[0]/timeStepend← line[1]/timeStepnote← line[3]− 21labels[start : end, note] = 1

end for

labels← DOWNSAMPLE(labels, hopSize)return labels

end function

First, the corresponding audio file is loaded and sampled at 22.05 kHz to retrieve the total

number of samples before downsampling. An empty matrix of shape (t, 88) is created to

represent the labels where t is the number of frames. Each line of the MIDI event file is

then read and all samples between the onset and offset time inclusive, are set to active.

Finally, the label matrix is downsampled with a hop size of 512, rendering the number of

frames equivalent to that in the corresponding CQT. Examples of the CQT and label

output are shown in Figures 1.1, 1.2, and 1.3.

19

3 Evaluation

Results produced by the network are evaluated for Accuracy, Precision, Recall, and

F-Measure with the latter being the preferred measure of transcription correctness [1].

F-Measure provides a balanced metric between Precision, a measure of how many

instances are classified correctly, and Recall which a measure of how many instances are

missed. Precision and Recall provide better insight into model suitability when compared

to Accuracy for datasets with imbalanced classed such as audio signals.

Accuracy =NTP

NTP +NFP +NFN

Precision =NTP

NTP +NFP

Recall =NTP

NTP +NFN

F −Measure =2 · Precision ·Recall

Precision+Recall

where NTP, NFP, NFN, are the number of true positives, false positives, and false negatives

respectively. The Scikit-Learn library was used in all cases to calculate the evaluation

metrics [23].

4 Chapter Summary

In this chapter, a basic AMT network was developed and discussed based on the two-stage

models used by state-of-the-art methods. The MAPS dataset was selected for training and

testing due to its robust selection of piano recordings. Finally, evaluation metrics were

discussed and selected for comparison against state-of-the-art results. With a basic AMT

network implementation, dataset, and evaluation metrics chosen initial AMT investigation

20

is conducted in Chapter 4.

21

CHAPTER IV

PRELIMINARY AMT NETWORK INVESTIGATION

With the network structure, dataset, and evaluation metrics determined in Chapter 3, the

effects of data partitions, training iterations (epochs), and network structure can be

explored. A data partition determines which portions of the dataset constitute the training,

validation, and testing data for the neural network experiment. Three data partitions,

detailed in following sections, were developed for investigation and are outlined in

Table 4.1.

Table 4.1: Investigation Data Partitions

Audio Type

Name MAPS Subsets Training Testing

Notes & Chords SptkBGCl, StbgTGd2 ISOL, RAND, UCHO MUS

Limited Set (LSET) All RAND, UCHO, MUS MUS

Complete Set (CSET) All ISOL, RAND, UCHO, MUS MUS

The basic AMT network described in Section 3.1 requires a fixed sequence of frames as

input for training and testing. A sequence length of 100 was selected, representing 2.36

seconds of audio, to capture music language features. For each data partition, a single

feature and label matrix pair was constructed by concatenating the respective input and

output data together. These matrices were then divided into sequences of length 100 for

input into the neural network. The following sections explore the effects on transcription

accuracy when training on only notes and chords, the addition of music and increased

training iterations, and the affects of network ensembling.

22

1 Notes and Chords

To test whether a network could successfully be trained on only notes and chords, a data

partition by the same name was compiled. Data selection was limited to two of the nine

MAPS subsets, SptkBGCl and StbgTGd2, to facilitate short training and testing times.

From these two groups, the training set included all the files from the ISOL, UCHO, and

RAND audio files. Construction of the validation and testing sets was completed using a

permutation of the list of MUS files in the chosen MAPS subsets. The first ten music

pieces were selected as the validation set, and the next ten were selected as the testing set.

Figure 4.1: Notes and Chords Ground Truth: First 2000 frames

The AMT network was trained for five iterations with a batch size of thirty two which

required 368 seconds, or 6.1 minutes to complete. Figure 4.1 shows the label values for

the first two thousand frames of the output, with Figure 4.2 visualizing the transcription

produced by the network. A brief visual inspection of Figure 4.2 reveals that the network

did not perform adequately, with very little resemblance to Figure 4.1.

Prediction metrics yielded an F-Measure slightly below 3% for this subset. The errors in

23

Figure 4.2: Notes and Chords Transcription: First 2000 frames

Figure 4.3: Notes and Chords Transcription Error: First 2000 frames

the transcription are visualized in Figure 4.3, with false negatives represented in blue, and

false positives in red. Table 4.2 records the prediction metrics calculated against the entire

output of the testing set. Several modifications to the network were made to improve these

24

Table 4.2: Evaluation Metrics: Notes and Chords

Iteration Accuracy Precision Recall F-Measure

5 0.0967 0.2022 0.0749 0.1015

metrics with the number and type of layers, iterations, and batch size were altered with

negligible improvement to transcription accuracy indicating that changes to the data set

were required.

2 Music and Chords

The preliminary training sets were constructed on the premise that a network could be

successfully trained to transcribe music using only isolated notes, and several variations of

individual chords. This assumption was found to be false as melodies contained in music

are composed of sequences of chords and notes. As discussed in Chapter 1, these

sequences cause additional harmonics in the instrument resulting in significantly different

power spectra and is evidenced by comparing Figures 4.1 and 4.2. Additionally, musical

language information such as key, rhythm, and tempo are unavailable when training only

on notes and chords. This issue of polyphonic harmonics and addition of music language

information is easily resolved by training the network on files from the MUS set [4].

The datasets were reconstructed with the training set consisting of the complete UCHO

and RAND sets, and MUS files that were not included in the validation or testing sets. The

ISOL set was excluded from the training set as preliminary experiments, not discussed in

this thesis, demonstrated network improvement with its exclusion. This recompiled set

was referred to as the Limited Training Set (LSET) due to its exclusion of the ISOL set.

The network was trained using LSET twice for five and fifteen iterations. Each round of

training was conducted with a batch size of thirty two, taking around 280 seconds per

25

iteration.

Figure 4.4 shows the label values for the first two thousand frames of the testing set

output. The transcription shown in Figure 4.5 bears a much stronger resemblance to the

expected output indicating significant improvement in network performance with the

addition of MUS data to the training set. The F-Measure for this slice of the output was

calculated to be 0.7770, which is twenty six times greater than what was found on similar

slice of the previous network’s output. Figure 4.6 represents the transcription errors and

illustrates that the network continues to have difficulty recognizing some notes. Further

inspection of the false positives shows that these errors commonly occur after an actual

note has played which indicates poor note offset detection.

Figure 4.4: LSET Ground Truth: First 2000 Frames

Table 4.3 compares the prediction metrics for the entire network’s output for both

instances of the AMT network trained on the LSET dataset. Increasing the number of

epochs yielded an increase of 0.01947 in the F-Measure, equaling a 2% increase in

accuracy. Training on the LSET dataset improved the F-Measure of the network by 6.75

times that of the initial network demonstrating that training the network on actual music is

26

Figure 4.5: LSET Transcription: First 2000 Frames

Figure 4.6: LSET Transcription Error: First 2000 Frames

required to get reasonable transcription accuracy.

27

Table 4.3: Evaluation Metrics: LSET


5 0.1976 0.7703 0.6429 0.6662

15 0.2284 0.7511 0.6899 0.6857

3 Network Ensembling

An ensemble of neural networks has been shown to increase general network performance

by reducing the impact of local minima caused by weight initialization in similar

networks. Network improvement is achieved by reconciling the output of two or more

networks via methods such as voting or averaging [24]. To create a network ensemble, two

or more trained models are loaded and supplied input in parallel, with the output of each

model being fed into a merging layer which provides the ensemble output. A high level

representation of this idea is presented in Figure 4.7. This idea was extended to different

AMT models trained for a variable number of epochs on differing data sets.

Figure 4.7: Network Ensembling

The time distributed polyphonic AMT network described in Figure 3.1 was selected for

this experiment. Based on the networks by Wang et al. and Liu et al., a deeper MPE stage

28

consisting of multiple convolutional blocks is investigated for improvement in

transcription accuracy [3][7]. The basic AMT network was copied and extended

increasing the number of convolutional blocks within the MPE stage. In addition to

increasing the depth of the network model, the LSET data partition was augmented with

the ISOL files completing the MAPS dataset. This data partition is referred to as the

Complete Set (CSET). Eight network models are described in Table 4.4 for use in the

network ensemble with networks results enumerated in Table 4.5.

Table 4.4: Single Network Model Descriptions

Convolutional

Name Blocks Iterations Data Set

1La 5It Lset 1 5 LSET

1La 5It Cset 1 5 CSET







Table 4.5: Single Network Model Results

Name Precision Recall F-Measure

1La 5It Lset 0.7709 0.6404 0.6646

1La 5It Cset 0.7564 0.6331 0.6482

1La 10It Lset 0.7652 0.6591 0.6743

1La 10It Cset 0.7311 0.6855 0.6689

3La 5It Lset 0.7061 0.7336 0.6871

3La 5It Cset 0.7247 0.6104 0.6241

3La 10It Lset 0.7005 0.7055 0.6714

3La 10It Cset 0.7098 0.6568 0.6472

Of the eight network models tested the 3La 5It Lset network, which consists of three

convolutional layers trained for five epochs on the LSET dataset, performed the best with

an F-Measure of 0.6871. Figure 4.8 illustrates the transcription error in the first 2000

29

frames of the testing set for this network.

Figure 4.8: 3La 5It Lset Transcription Error: First 2000 Frames

To merge the individual network output in the AMT ensemble, an averaging layer was

used. A post-processing threshold of 0.4 used in previous tests is maintained to produce a

final transcription. Table 4.6 describes four ensemble networks with their component

models, which were evaluated for general network improvement.

Table 4.6: Ensemble Network Contents

Name Models

Ens 5lt Lset 1La 5It Lset 3La 5It Lset – –

Ens 5lt Cset 1La 5It Cset 3La 5It Cset – –

Ens 5It Mset 1La 5It Lset 1La 5It Cset 3La 5It Lset 3La 5It Cset

Ens Miter Lset 1La 5It Lset 3La 5It Lset 1La 10It Lset 3La 10It Lset

Table 4.7 compares the output of the four ensemble networks, which excluding the CSET

only models, performed better than individual networks in terms of F-Measure. The

Ens 5lt Cset results can be dismissed as the component models generally performed the

worst out of all individual networks. The Ens Miter Lset ensemble consisting of varied

30

Table 4.7: Ensemble Network Results

Name Precision Recall F-Measure

Ens 5lt Lset 0.7711 0.6996 0.7022

Ens 5lt Cset 0.7746 0.6461 0.6675

Ens 5It Mset 0.7902 0.6831 0.6998

Ens Miter Lset 0.7857 0.7178 0.7230

models and training times on the LSET dataset performed the best with an improvement

of 0.0359 in F-Measure when compared to the best single network. Figure 4.9 illustrates

the transcription error for the first 2000 frames of the Ens Miter Lset ensemble. The

ensembled networks consistently scored better in precision by reducing the number of

false positives in red, but suffered a relatively smaller decrease in recall increasing false

negatives (blue), resulting in an increased number of missing notes.

Figure 4.9: Ens Miter Lset: First 2000 Frames

31

4 Chapter Summary

Preliminary investigation into AMT revealed several key concepts to consider as the

neural networks are improved further. First, training the network on music allows for the

learning of musical structure and harmonic patterns which is required to achieve adequate

transcription accuracy. Next, the addition of multiple convolutional blocks within the

MPE stage improved transcription accuracy scores for individual networks. Finally, the

application of AMT network ensembling further improved final transcription accuracy and

should be considered in future experiments.

32

CHAPTER V

IMPROVING AMT NETWORK MODELS

Evaluation of the transcription errors generated by the preliminary networks demonstrates

there is considerable room for improvement in transcription accuracy. Note onset and

offset detection remains an issue, but initial results are promising. A frame context

window is an improvement to the preliminary networks that augments the frame by frame

input to a two dimensional matrix [3]. Figure 5.1 illustrates the enhancement where a

window of 2k + 1 frames, centered on time t, is used to determine the label for time t. A k

value of four was chosen for the remaining experiments resulting in an input matrix with a

shape of 9 x 264 at each time slice.

Figure 5.1: Frame Context Window

To accommodate the new two-dimensional input shape and the increased network

parameters, multiple modifications were made to the existing network models. Altering

the convolutional layers from one-dimensional to two-dimensional input represents the

primary change, with the number of filters on each convolutional layer being reduced to

enable the network to fit within GPU memory. The two-dimensional input’s effect on

33

Figure 5.2: Two and Three Convolutional Layer Networks

34

transcription accuracy was evaluated by implementing two distinct networks with two and

three convolutional layers respectively, which are outlined in Figure 5.2. The MPE stage

in each network is represented in the subnetwork of convolutional layers through the batch

normalization layer, with note tracking represented by the LSTM and dense output layers.

Of note in the three convolutional layer network is the lack of a max pooling layer after

the first convolutional layer, and the second max pooling layer’s pool size is reduced from

(2,2) to (1,2). These differences ensure proper tensor shape as data is processed by the

network layers, this mitigates data loss by ensuring a many-to-one or one-to-one

relationship from the hidden units of the final convolutional layer to those in the LSTM

layer.

Figure 5.3: Per File Network Training

function TRAINNETWORK(dataset, epochs)files← GETDATASETFILES(dataset)iteration← 1while iteration ≤ epochs do

SHUFFLE(files)for all files do

data← GETDATA(file)labels← GETLABELS(file)KERASTRAIN(data, labels)

end for

end while

end function

In addition to the frame context window, the network training and testing method was

improved for finer control. Whereas the preliminary networks were trained and tested by

concatenating the files of the respective datasets together before slicing into 100 sample

sequences, the algorithm in Figure 5.3 allows for per file control during training and

testing. This method requires the entire audio sequence be padded with frames of -80dB,

representing silence, until it is divisible by the sequence length. The padding frames are

prevented from altering network training with the addition of a masking layer on the input,

35

which skips frames whose values all equal the masking value of -80dB.

An evaluation dataset for the two and three convolutional layer networks was developed

by selecting the StbgTGd2 and ENSTDkCl subsets for digital and acoustic representations

respectively, following the data partitioning examples in Wang et al. [3]. Fifteen pieces of

music from each of these two subsets were randomly selected to represent the testing set,

with the remaining files added to the training set which consists of all the MUS data from

the other seven MAPS subsets. Comparison of the results in Tables 5.1 and 5.2 shows the

three convolutional layer network outperforming the two convolutional layer network with

an F-Measure of 0.7210 and 0.7125 for digital and acoustic music respectively. These

results are promising for an individual network as the metrics approach those of the

ensemble networks in Table 4.7.

Table 5.1: Context Window Results - Two Convolutional Layers


Digital

10 0.1787 0.7101 0.6223 0.6247

20 0.2109 0.7187 0.7017 0.6758

30 0.2158 0.7148 0.7184 0.6835

40 0.2328 0.7302 0.7156 0.6904

50 0.1780 0.6757 0.7393 0.6710

Acoustic

10 0.1535 0.6971 0.6021 0.6099

20 0.2088 0.7287 0.7081 0.6866

30 0.2005 0.7069 0.7334 0.6889

40 0.2281 0.7310 0.7234 0.6967

50 0.1796 0.6765 0.7434 0.6765

36

Table 5.2: Context Window Results - Three Convolutional Layers


Digital

10 0.1643 0.6931 0.7053 0.6612

20 0.2352 0.7479 0.6418 0.6560

30 0.2633 0.7388 0.7458 0.7135

40 0.3334 0.8070 0.6787 0.7104

50 0.2999 0.7614 0.7354 0.7210

Acoustic

10 0.1746 0.7205 0.6963 0.6775

20 0.1914 0.7410 0.6101 0.6357

30 0.2372 0.7413 0.7292 0.7064

40 0.2680 0.8011 0.6511 0.6873

50 0.2610 0.7572 0.7262 0.7125

1 LSTM Network Performance

To test the three convolutional layer model, referred to as the LSTM network rigorously,

five datasets were developed for final testing. Maintaining a similar standard to the initial

network test, the StbgTGd2 and ENSTDkCl subsets were used for digital and acoustic

testing. These two subsets share the same thirty pieces of music recorded on different

instruments and in different conditions. This condition was leveraged by selecting a

random subset of half the thirty pieces of music and including both the StbgTGd2 and

ENSTDkCl recordings of that piece within the testing set, for a total testing set size of

thirty or 11% of the total MAPS music corpus. The five test set partitions are described in

Table 5.3.

The files within the StbgTGd2 and ENSTDkCl subsets not included in the testing set were

combined with the music files from the remaining seven MAPS subsets to create the

training set. A validation set was not created for any dataset due to the limited number of

total music pieces available for training. Instead each network was trained for 100

iterations without the use of early stopping. A snapshot was taken of the network model

37

Table 5.3: Test Set Partitions

Dataset 1

bor ps6 chpn-p19 deb clai deb menu liz et6

liz et trans5 mz 331 2 mz 331 3 mz 333 2 mz 545 3

schuim-1 scn15 11 scn16 3 scn16 4 ty mai

Dataset 2

alb se2 bk xmas5 chpn-p19 grieg butterfly liz rhap09

mz 331 3 mz 332 2 mz 333 2 mz 333 3 mz 545 3

mz 570 1 pathetique 1 schu 143 3 schuim-1 scn15 11

Dataset 3

alb se2 bk xmas4 bor ps6 deb clai liz et trans5

liz rhap09 mz 545 3 mz 570 1 pathetique 1 scn15 11

scn15 12 scn16 3 scn16 4 ty maerz ty mai

Dataset 4

alb se2 bk xmas1 bk xmas4 bor ps6 chpn-p19

deb clai deb menu liz et6 mz 332 2 mz 333 2

mz 333 3 mz 545 3 mz 570 1 schuim-1 ty maerz

Dataset 5

bk xmas1 bk xmas4 bor ps6 chpn-e01 deb menu

grieg butterfly liz et trans5 liz rhap09 mz 333 3 mz 545 3

pathetique 1 schuim-1 scn15 11 scn15 12 ty maerz

before training, and at every five iterations until training was completed. Transcriptions

were determined by a threshold of 0.5. Complete testing results for the LSTM network are

provided in Appendix A, with Table 5.4 summarizing the network iterations with the

highest F-Measure for each dataset.

Table 5.4: LSTM Network Result Summary

Dataset Iteration Accuracy Precision Recall F-Measure

Digital

1 100 0.3568 0.7797 0.7958 0.7628

2 60 0.3428 0.7702 0.7908 0.7556

3 100 0.3715 0.8027 0.6945 0.7195

4 90 0.3176 0.7693 0.8431 0.7793

5 85 0.3306 0.7672 0.7662 0.7427

Acoustic

1 90 0.2917 0.7341 0.7784 0.7275

2 60 0.2955 0.7458 0.7331 0.7113

3 95 0.2538 0.7044 0.7406 0.6935

4 90 0.3020 0.7531 0.8248 0.7614

5 85 0.3090 0.7423 0.7233 0.7060

38

2 BLSTM Network Performance

The final improvement investigated for single AMT neural networks is the

implementation of Bidirectional LSTM (BLSTM) layers within the Music Language

Model. BLSTMs are a form of Bidirectional Recurrent Neural Network (BRNN) shown to

improve speech recognition and machine translation tasks which are similar in nature to

automatic music transcription [25][26]. BRNNs allow networks to learn representations

from both the past and future of the sequence by implementing two RNN layers in parallel

where the first RNN processes the input sequence in order, the second RNN processes the

input in reverse order as shown in Figure 5.4.

Figure 5.4: Bidirectional Recurrent Neural Network

Forward propagation in BRNNs is conducted in two stages where the input is processed in

its entirety from the initial frame until the end of the sequence is reached, after which the

process is repeated in reverse starting at the end of the sequence and progressing towards

the beginning. After both stages are complete, the output is computed and

backpropagation is conducted [27]. The bidirectional implementation provided by the

Keras library was utilized to replace the LSTM in the Music Language Model with a

BLSTM. The resultant BLSTM network model was trained on the dataset partitions in

Table 5.3 with complete testing results available in Appendix B. Table 5.5 lists the model

iteration with the highest F-Measure for each dataset.

39

Table 5.5: BLSTM Network Result Summary

Dataset Iteration Accuracy Precision Recall F-Measure

Digital

1 100 0.4608 0.8248 0.8124 0.7990

2 80 0.4537 0.8110 0.8085 0.7901

3 95 0.4168 0.8022 0.7664 0.7626

4 100 0.3834 0.7946 0.8628 0.8055

5 95 0.3667 0.7813 0.7898 0.7635

Acoustic

1 95 0.3982 0.7891 0.7906 0.7670

2 75 0.3957 0.7820 0.7703 0.7537

3 70 0.3358 0.7621 0.7536 0.7338

4 100 0.3501 0.7678 0.8494 0.7838

5 95 0.3695 0.7598 0.7671 0.7404

Of particular note in Table 5.5 is nearly all of the F-Measures reported for the digital and

acoustic datasets outperformed the F-Measure of 0.7476 reported in Sigtia et al. [4].

Additionally, the acoustic F-Measures outperformed the 0.7451 on the ENSTDkCl subset

reported by Wang et al. [3].

3 Ensemble Performance

To achieve the highest possible frame by frame F-Measure, network ensembles were

created by selected in two highest performing model iterations in each of the LSTM and

BLSTM networks. Keras was used to load the trained models which were given input data

in parallel, with the network outputs being processed by a averaging layer. Table 5.6

outlines the ensemble network’s structure, listing the per model iteration snapshots used

for each dataset. A complete report of per file and overall evaluation metrics is provided in

Appendix C, with Table 5.7 summarizing the overall results for each dataset.

The network ensemble results show overall improvement in F-Measure over the BRNN

networks between 0.099 - 0.219 and 0.015 - 0.219 for digital and acoustic testing sets

40

Table 5.6: Ensemble Network Contents

Model Iteration

Dataset LSTM BLSTM

1 100 90 100 90

2 95 100 80 75

3 100 85 95 70

4 90 85 100 95

5 85 60 95 75

Table 5.7: Ensemble Network Result Summary

Dataset Accuracy Precision Recall F-Measure

Digital

1 0.4899 0.8372 0.8282 0.8135

2 0.4930 0.8287 0.8118 0.8000

3 0.4612 0.8306 0.7633 0.7745

4 0.4690 0.8457 0.8490 0.8274

5 0.4176 0.8132 0.7844 0.7779

Acoustic

1 0.4291 0.8135 0.7910 0.7804

2 0.4296 0.8067 0.7615 0.7618

3 0.3907 0.8081 0.7154 0.7353

4 0.4296 0.8246 0.8284 0.8057

5 0.4093 0.7960 0.7501 0.7501

respectively dependent on dataset. Analyzing the individual file performance in Appendix

C reveals a wide variance across files outlined in Table 5.8. The general consistency

between how files performed across datasets and the ensemble network implies that

compositional complexity shown in Figure 5.6 could contribute significantly to

transcription accuracy, with datasets containing a higher number of complex pieces seeing

reduces overall accuracy.

41

Table 5.8: Ensemble Best and Worst File Results

Worst Best

Dataset File F-Measure File F-Measure

Digital

1 mz 545 3 0.6829 scn16 4 0.9131

2 mz 545 3 0.6826 mz 333 2 0.9178

3 liz rhap09 0.6256 scn16 4 0.9004

4 mz 545 3 0.6792 mz 333 2 0.9199

5 liz rhap09 0.6433 ty maerz 0.8921

Acoustic

1 liz et trans5 0.6006 scn16 4 0.9050

2 liz rhap09 0.5932 mz 333 2 0.8906

3 liz et trans5 0.5406 scn16 4 0.8997

4 mz 545 3 0.6731 mz 333 2 0.8919

5 liz et trans5 0.5866 bk xmas1 0.8756

Figure 5.5: Transcription Error: Acoustic Dataset 1 - scn16 4

Figure 5.7 provides a comparison between the acoustic transcription and ground truth for

the first minute of ’scn16 4’s produced by the ensemble trained on Dataset 1, which

reported an overall F-Measure of 0.9050. While the transcription closely resembles the

42

ground truth, visual analysis can identify several complete false positives, spurious notes,

and long notes that have been incorrectly segmented. Figure 5.5 presents a visualization of

the errors related to the transcription in Figure 5.7, with false positives in red and false

negatives in blue. Most of the false positives in this example can be classified as spurious

notes or incorrect offset determination, where notes were held too long. Still some

extended false positive notes exist and indicate a failure in the MPE stage. False negatives

affecting the recall metric in this example fall into two categories, incorrect offset

determination where notes were cut off too early and long note segmentation errors where

sustained notes are broken into several smaller notes, with the former accounting for the

sustained false negatives.

The investigation and implementation of AMT neural networks has demonstrated steady

improvement culminating in the LSTM/BLSTM Ensemble, which outperforms

state-of-the-art techniques. Table 5.9 outlines the best results for the network models and

data partition combinations in the order of development. Important features which

advanced and improved transcription accuracy, were the inclusion of music within the

training data, deeper convolutional networks in the MPE stage, network ensembling, and

the implementation of a Bidirectional LSTM in the NT stage.

Table 5.9: Network Result Summary

Network Model Data Partition F-Measure

Basic AMT Notes & Chords 0.1015

Basic AMT LSET (Chords and Music) 0.6857

3La 5It Lset LSET (Chords and Music) 0.6871

Ens Miter Lset LSET (Chords and Music) 0.7230

Two Convolutional Layers Frame Context Window 0.6967

LSTM (Three Convolutional Layers) Frame Context Window 0.7125

LSTM (Three Convolutional Layers) Dataset 4 0.7614

BLSTM Dataset 4 0.7838

LSTM/BLSTM Ensemble Dataset 4 0.8057

43

Figure 5.6: Comparison of Composition Complexity

44

Figure 5.7: Label and Transcription Comparison: scn16 4

45

CHAPTER VI

CONCLUSION

This thesis has presented a background on Automatic Music Transcription covering

polyphonic piano music, which remains an open problem in the field of MIR. This task

has the goal of converting an audio signal into an accurate high-level representation of the

music being played. Polyphonic AMT is challenging due to the variable nature of audio

input and note interactions within the time-frequency domain. Several network and dataset

configurations based on Sigtia et al. and Wang et al. were investigated [4][3]. The LSTM,

BLSTM, and Ensemble networks developed within are proposed improvements over

existing methods for the Polyphonic AMT task.

Table 6.1: State of the Art Comparison

Precision Recall F-Measure

Sigtia et al. 2016 [4] 0.7270 0.7694 0.7476

Wang et al. 2018 [3] 0.7009 0.7952 0.7451

Liu et al. 2018 [7] – – 0.6502

Valero et al. 2018 [1] – – 0.70

LSTM 0.7531 0.8248 0.7614

BLSTM 0.7678 0.8494 0.7838

Ensemble 0.8246 0.8286 0.8057

Table 6.1 compares frame based state-of-the-art transcription metrics for acoustic piano

pieces. Comparison of acoustic results is more applicable to real world situations where

instrument sound and quality are variable. When comparing against the reported

measurements for Sigtia and Wang on which the networks were based, the proposed

LSTM BLSTM Ensemble yields an improvement in F-Measure of 0.581, with each of the

component networks also outperforming to a lesser extent. This improvement can be

46

attributed to the increase in MPE stage depth with additional convolutional blocks, the

substitution of the LSTM for the BLSTM which provided knowledge of future sequence

information at each frame, and finally network ensembling.

1 Future Work

After review of the results obtained, significant room for network improvement remains

offering several opportunities for future work. Inclusion of note onset and offset

information into NT stage has been shown to improve transcription accuracy in

state-of-the-art methods presenting an opportunity to further improve upon the LSTM and

BLSTM designs [1][3][7]. Additionally, application of post-processing filters used in the

binarization process, to remove spurious and over segmented notes, by Valero-Mas et al.

could be investigated. Review of the transcription results produced by LSTM and BLSTM

ensemble revealed erroneous notes indicating room for improvement in the MPE stage.

Research into new network models, application of machine translation and speech

recognition techniques may provide possible solutions. When developing new network

models, Attention Networks (ANs) should be considered as such networks have shown

promising results in sequence to sequence models for speech recognition [28]. To the best

of the author’s knowledge, these network models have not been applied to the AMT task

at the time of writing. Finally, while the MAPS dataset is robust it contains a limited

number of music pieces isolated to the classical music genre; the development of a more

robust dataset may contribute to improvements polyphonic AMT accuracy.

47

REFERENCES

[1] J. J. Valero-Mas, E. Benetos, and J. M. Inesta, “A supervised classification approach

for note tracking in polyphonic piano transcription,” Journal of New Music Research,

pp. 1–15, 2018.

[2] A. P. Klapuri, “Automatic music transcription as we know it today,” Journal of New

Music Research, vol. 33, no. 3, pp. 269–282, 2004.

[3] Q. Wang, R. Zhou, and Y. Yan, “Polyphonic piano transcription with a note-based

music language model,” Applied Sciences, vol. 8, no. 3, p. 470, 2018.

[4] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network for polyphonic

piano music transcription,” IEEE/ACM Transactions on Audio, Speech, and

Language Processing, vol. 24, no. 5, pp. 927–939, 2016.

[5] C. O’Brien and M. D. Plumbley, “Automatic music transcription using low rank

non-negative matrix decomposition,” in Signal Processing Conference (EUSIPCO),

2017 25th European, pp. 1848–1852, IEEE, 2017.

[6] P. Smaragdis, B. Raj, and M. Shashanka, “A probabilistic latent variable model for

acoustic modeling,” Advances in models for acoustic processing, NIPS, vol. 148,

pp. 8–1, 2006.

[7] S. Liu, L. Guo, G. A. Wiggins, et al., “A parallel fusion approach to piano music

transcription based on convolutional neural network,” in 2018 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 391–395,

IEEE, 2018.

[8] E. Benetos, T. Weyde, et al., “An efficient temporally-constrained probabilistic

model for multiple-instrument music transcription,” International Society for Music

Information Retrieval, 2015.

[9] S. Sigtia, E. Benetos, S. Cherla, T. Weyde, A. Garcez, and S. Dixon, “Rnn-based

music language models for improving automatic music transcription,” International

Society for Music Information Retrieval, 2014.

[10] S. Sigtia, E. Benetos, N. Boulanger-Lewandowski, T. Weyde, A. S. d. Garcez, and

S. Dixon, “A hybrid recurrent neural network for music transcription,” in 2015 IEEE

International Conference onAcoustics, Speech and Signal Processing (ICASSP),

pp. 2061–2065, IEEE, 2015.

[11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by

back-propagating errors,” Nature, vol. 323, no. 6088, p. 533, 1986.

48

[12] A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint

arXiv:1308.0850, 2013.

[13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,

“Dropout: a simple way to prevent neural networks from overfitting,” The Journal of

Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[14] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training

by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.

[15] F. Chollet et al., “Keras.” https://keras.io, 2015.

[16] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv

preprint arXiv:1412.6980, 2014.

[17] V. Emiya, R. Badeau, and B. David, “Multipitch estimation of piano sounds using a

new probabilistic spectral smoothness principle,” IEEE Transactions on Audio,

Speech, and Language Processing, vol. 18, no. 6, pp. 1643–1654, 2010.

[18] G. E. Poliner and D. P. Ellis, “A discriminative model for polyphonic piano

transcription,” EURASIP Journal on Advances in Signal Processing, vol. 2007, no. 1,

p. 048317, 2006.

[19] S. Bock and M. Schedl, “Polyphonic piano note transcription with recurrent neural

networks.,” in ICASSP, pp. 121–124, 2012.

[20] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “Rwc music database: Music

genre database and musical instrument sound database,” Johns Hopkins University,

2003.

[21] J. C. Brown, “Calculation of a constant q spectral transform,” The Journal of the

Acoustical Society of America, vol. 89, no. 1, pp. 425–434, 1991.

[22] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto,

“librosa: Audio and music signal analysis in python,” in Proceedings of the 14th

Python in Science Conference, pp. 18–25, 2015.

[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,

M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., “Scikit-learn: Machine

learning in python,” Journal of machine learning research, vol. 12, no. Oct,

pp. 2825–2830, 2011.

[24] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 993–1001, 1990.

[25] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent

neural networks,” in IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP), 2013, pp. 6645–6649, IEEE, 2013.

49

[26] M. Sundermeyer, T. Alkhouli, J. Wuebker, and H. Ney, “Translation modeling with

bidirectional recurrent neural networks,” in Proceedings of the 2014 Conference on

Empirical Methods in Natural Language Processing (EMNLP), pp. 14–25, 2014.

[27] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE

Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.

[28] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan,

R. J. Weiss, K. Rao, E. Gonina, et al., “State-of-the-art speech recognition with

sequence-to-sequence models,” in 2018 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778, IEEE, 2018.

50

APPENDIX A

LSTM NETWORK RESULTS

Table A.1: Digital Results: Single LSTM - Dataset 1


0 0 0.0442 0.4671 0.0789

5 0.1748 0.7183 0.6734 0.6575

10 0.1426 0.6488 0.7333 0.6536

15 0.1244 0.6277 0.7497 0.649

20 0.2633 0.7533 0.7385 0.7165

25 0.2342 0.7209 0.7674 0.7124

30 0.3598 0.8162 0.753 0.7567

35 0.3339 0.8047 0.7397 0.7437

40 0.2901 0.7645 0.7634 0.7353

45 0.2842 0.7565 0.7571 0.7274

50 0.3389 0.7951 0.7714 0.7575

55 0.274 0.7437 0.7763 0.7307

60 0.3242 0.7774 0.7715 0.7476

65 0.3454 0.8061 0.739 0.7446

70 0.362 0.7966 0.7654 0.7557

75 0.2868 0.7387 0.7744 0.7252

80 0.3388 0.786 0.7666 0.7489

85 0.3385 0.784 0.7755 0.7526

90 0.3173 0.7593 0.8077 0.7576

95 0.2263 0.6913 0.7925 0.7068

100 0.3568 0.7797 0.7958 0.7628

51

Table A.2: Acoustic Results: Single LSTM - Dataset 1


0 0 0.0437 0.4903 0.0786

5 0.1628 0.6898 0.6353 0.6268

10 0.1531 0.6387 0.6915 0.6311

15 0.144 0.6274 0.7094 0.6332

20 0.2475 0.7245 0.6811 0.6712

25 0.2297 0.7051 0.7159 0.6805

30 0.2863 0.7772 0.6662 0.6859

35 0.3053 0.7857 0.6819 0.7003

40 0.26 0.7371 0.7216 0.7001

45 0.2673 0.7347 0.7223 0.6982

50 0.332 0.7823 0.7437 0.7341

55 0.2651 0.723 0.7479 0.7046

60 0.3017 0.7591 0.7333 0.718

65 0.3268 0.7902 0.7044 0.7159

70 0.3028 0.7597 0.7213 0.7096

75 0.2301 0.6911 0.7345 0.6755

80 0.2921 0.7596 0.715 0.707

85 0.3019 0.7513 0.7437 0.718

90 0.2917 0.7341 0.7784 0.7275

95 0.2104 0.6643 0.7588 0.6771

100 0.306 0.7475 0.7599 0.7257

52



0 0 0.04 0.3976 0.0704

5 0.2056 0.7452 0.6694 0.67

10 0.2277 0.7486 0.6579 0.6668

15 0.2456 0.7353 0.7498 0.7135

20 0.1953 0.6872 0.7575 0.6879

25 0.1585 0.6307 0.7934 0.6715

30 0.2697 0.7812 0.5402 0.6076

35 0.1773 0.6608 0.7811 0.6836

40 0.125 0.6186 0.7776 0.6546

45 0.1874 0.6701 0.7814 0.6891

50 0.2237 0.6834 0.8045 0.7084

55 0.2207 0.6845 0.8028 0.7078

60 0.3428 0.7702 0.7908 0.7556

65 0.1692 0.6403 0.8118 0.684

70 0.2079 0.679 0.7903 0.6999

75 0.3597 0.7924 0.7322 0.7349

80 0.1611 0.6358 0.8039 0.6763

85 0.2286 0.6935 0.7848 0.7063

90 0.3578 0.8016 0.7158 0.7298

95 0.3365 0.7653 0.7751 0.7448

100 0.3465 0.7736 0.7782 0.7505

53



0 0 0.0402 0.4058 0.0708

5 0.1968 0.7238 0.6186 0.6347

10 0.247 0.7378 0.6053 0.633

15 0.247 0.7149 0.7034 0.6801

20 0.2112 0.6823 0.7125 0.667

25 0.1813 0.6223 0.7597 0.6541

30 0.2201 0.7134 0.4489 0.5196

35 0.2136 0.6717 0.7391 0.674

40 0.1525 0.6125 0.7275 0.6328

45 0.2232 0.6791 0.7466 0.6829

50 0.2279 0.682 0.7531 0.6862

55 0.2086 0.6588 0.7543 0.6718

60 0.2955 0.7458 0.7331 0.7113

65 0.189 0.629 0.7816 0.6654

70 0.2353 0.6735 0.7508 0.6789

75 0.3046 0.7593 0.6642 0.6783

80 0.1926 0.6436 0.7718 0.6707

85 0.2608 0.6961 0.7446 0.6903

90 0.3132 0.7691 0.6473 0.6739

95 0.3074 0.746 0.7181 0.7033

100 0.3104 0.7468 0.7231 0.7069

54



0 0 0.047 0.4481 0.0827

5 0.1345 0.6451 0.6829 0.6282

10 0.2187 0.738 0.6677 0.6674

15 0.2172 0.7169 0.707 0.6812

20 0.2428 0.7359 0.721 0.6978

25 0.3118 0.8022 0.6597 0.6945

30 0.2977 0.7762 0.7044 0.7109

35 0.2605 0.7455 0.7218 0.7036

40 0.1678 0.6335 0.7658 0.6601

45 0.3379 0.8 0.6882 0.7128

50 0.2507 0.719 0.7309 0.6952

55 0.2135 0.6854 0.7531 0.686

60 0.3346 0.7934 0.7008 0.7174

65 0.3362 0.7993 0.6322 0.6781

70 0.2005 0.6673 0.7542 0.6765

75 0.3291 0.7979 0.6574 0.6928

80 0.2173 0.681 0.7431 0.6783

85 0.3262 0.7816 0.7142 0.7191

90 0.1928 0.6608 0.7384 0.6636

95 0.2603 0.7089 0.7817 0.7146

100 0.3715 0.8027 0.6945 0.7195

55



0 0 0.0463 0.4329 0.0812

5 0.1365 0.6237 0.6641 0.6109

10 0.2273 0.7255 0.6449 0.6519

15 0.2173 0.71 0.6564 0.652

20 0.2300 0.7172 0.6681 0.6626

25 0.2760 0.7804 0.6175 0.6591

30 0.2717 0.7564 0.6474 0.6688

35 0.2628 0.7449 0.6787 0.6815

40 0.1650 0.6175 0.7326 0.6389

45 0.2895 0.7769 0.6306 0.6668

50 0.2140 0.6856 0.6803 0.6526

55 0.1923 0.6607 0.7079 0.6523

60 0.2891 0.7648 0.6548 0.6761

65 0.2743 0.7619 0.573 0.6234

70 0.1974 0.659 0.7199 0.6564

75 0.2716 0.7579 0.5951 0.6359

80 0.1913 0.6438 0.7029 0.6392

85 0.2767 0.7552 0.6609 0.6749

90 0.1800 0.6498 0.6594 0.6205

95 0.2538 0.7044 0.7406 0.6935

100 0.2972 0.7682 0.6289 0.6621

56



0 0 0.0448 0.4601 0.0793

5 0.1382 0.6719 0.7418 0.6698

10 0.2538 0.7886 0.7397 0.7308

15 0.3017 0.8271 0.6774 0.7138

20 0.1204 0.6299 0.8364 0.6826

25 0.2403 0.7438 0.7857 0.7324

30 0.3397 0.843 0.6569 0.7097

35 0.2678 0.7556 0.8021 0.7486

40 0.1562 0.6524 0.8032 0.6831

45 0.3248 0.8255 0.7187 0.7386

50 0.2149 0.7044 0.8385 0.7343

55 0.3758 0.8358 0.7601 0.7705

60 0.1825 0.6745 0.833 0.7128

65 0.3218 0.8039 0.78 0.7631

70 0.292 0.759 0.8241 0.7621

75 0.3061 0.8227 0.6116 0.6698

80 0.348 0.8239 0.7627 0.7648

85 0.3766 0.8265 0.7819 0.7788

90 0.3176 0.7693 0.8431 0.7793

95 0.1685 0.6628 0.8361 0.7041

100 0.4223 0.8557 0.7549 0.7771

57



0 0 0.0459 0.4790 0.0815

5 0.1480 0.6579 0.7191 0.6552

10 0.2418 0.7769 0.7046 0.7078

15 0.2673 0.7953 0.6362 0.6756

20 0.1467 0.6327 0.8196 0.6833

25 0.2638 0.7482 0.7616 0.7257

30 0.3068 0.8192 0.6368 0.6877

35 0.2643 0.7499 0.7689 0.7316

40 0.1904 0.6668 0.7765 0.6859

45 0.3111 0.807 0.6914 0.7164

50 0.2423 0.7105 0.8102 0.7287

55 0.3436 0.8212 0.7264 0.7439

60 0.2131 0.6802 0.8244 0.7168

65 0.3387 0.8109 0.7571 0.7567

70 0.3004 0.758 0.7908 0.7473

75 0.2392 0.7822 0.5281 0.5975

80 0.3340 0.8077 0.7349 0.7422

85 0.3410 0.8017 0.7517 0.7498

90 0.3020 0.7531 0.8248 0.7614

95 0.2103 0.6855 0.8179 0.7156

100 0.3584 0.8298 0.7064 0.7354

58



0 0 0.0547 0.5077 0.0954

5 0.1646 0.7304 0.4480 0.5175

10 0.1864 0.7040 0.4302 0.5011

15 0.129 0.6200 0.7467 0.6433

20 0.1453 0.6283 0.7545 0.6522

25 0.1063 0.5961 0.7429 0.6246

30 0.2800 0.7726 0.6385 0.6691

35 0.1945 0.6798 0.7550 0.6826

40 0.1455 0.6228 0.7733 0.6568

45 0.2958 0.7678 0.7240 0.7180

50 0.2581 0.7306 0.7407 0.7058

55 0.2172 0.6863 0.7802 0.6992

60 0.3100 0.7742 0.7308 0.7247

65 0.1593 0.6470 0.7536 0.6615

70 0.2609 0.7504 0.6907 0.6875

75 0.2295 0.7043 0.7758 0.7082

80 0.1550 0.6314 0.7443 0.6472

85 0.3306 0.7672 0.7662 0.7427

90 0.3472 0.7957 0.7080 0.7235

95 0.2795 0.7503 0.7309 0.7117

100 0.2717 0.7230 0.7776 0.7222

59



0 0 0.0541 0.5089 0.0946

5 0.1723 0.6986 0.4062 0.4803

10 0.1493 0.6266 0.3428 0.4143

15 0.1571 0.6245 0.7236 0.6390

20 0.1739 0.6198 0.7235 0.6349

25 0.1306 0.5961 0.7026 0.6114

30 0.2840 0.7485 0.6066 0.6397

35 0.2163 0.6762 0.7169 0.6655

40 0.1810 0.6365 0.7357 0.6507

45 0.2906 0.7514 0.6783 0.6843

50 0.2436 0.7043 0.6954 0.6698

55 0.2378 0.6889 0.7276 0.6785

60 0.2898 0.7492 0.6789 0.6845

65 0.1861 0.6468 0.7106 0.6455

70 0.2563 0.7362 0.6583 0.6637

75 0.2602 0.7114 0.7314 0.6928

80 0.1974 0.6497 0.7087 0.6456

85 0.3090 0.7423 0.7233 0.7060

90 0.3095 0.7639 0.6642 0.6815

95 0.2889 0.7410 0.6868 0.6829

100 0.2642 0.7120 0.7272 0.6912

60

APPENDIX B

BLSTM NETWORK RESULTS

Table B.1: Digital Results: Bidirectional LSTM - Dataset 1


0 0 0.0449 0.3929 0.0788

5 0.2046 0.7209 0.4798 0.5435

10 0.2256 0.7022 0.4807 0.5397

15 0.3467 0.7865 0.7506 0.7421

20 0.3816 0.8190 0.7313 0.7471

25 0.3548 0.7844 0.7694 0.7520

30 0.3585 0.7751 0.7901 0.7577

35 0.3653 0.7838 0.7656 0.7490

40 0.3795 0.7942 0.7709 0.7584

45 0.3898 0.7927 0.7826 0.7646

50 0.2345 0.6867 0.8371 0.7251

55 0.4172 0.8059 0.7909 0.7762

60 0.4417 0.8249 0.7974 0.7899

65 0.4376 0.8257 0.7791 0.7805

70 0.3662 0.7724 0.8308 0.7764

75 0.4369 0.8208 0.7804 0.7793

80 0.3937 0.7855 0.8357 0.7878

85 0.4466 0.8302 0.7688 0.7770

90 0.4313 0.8084 0.8261 0.7962

95 0.4370 0.8076 0.8174 0.7913

100 0.4608 0.8248 0.8124 0.7990

61

Table B.2: Acoustic Results: Bidirectional LSTM - Dataset 1


0 0 0.0444 0.3803 0.0774

5 0.1531 0.639 0.3878 0.4521

10 0.1735 0.6066 0.3952 0.4479

15 0.2918 0.7467 0.7069 0.6968

20 0.3265 0.7818 0.6980 0.7079

25 0.3234 0.7546 0.7332 0.7154

30 0.3128 0.7459 0.7417 0.7155

35 0.3285 0.7567 0.7327 0.7169

40 0.3216 0.7526 0.7453 0.7215

45 0.3339 0.7531 0.7539 0.7273

50 0.2502 0.6905 0.8032 0.7138

55 0.3497 0.7629 0.7581 0.7350

60 0.3769 0.7912 0.7687 0.7556

65 0.3817 0.7934 0.7441 0.7433

70 0.3308 0.7503 0.8035 0.7506

75 0.3721 0.7905 0.7376 0.7390

80 0.3671 0.7732 0.8048 0.7659

85 0.3768 0.7908 0.7308 0.7338

90 0.3717 0.7842 0.7927 0.7653

95 0.3982 0.7891 0.7906 0.7670

100 0.3958 0.7913 0.7765 0.7611

62



0 0 0.0378 0.3675 0.0663

5 0.2047 0.7117 0.6273 0.6344

10 0.2446 0.7412 0.5613 0.6081

15 0.251 0.7066 0.7612 0.7041

20 0.2973 0.7433 0.7238 0.7052

25 0.2619 0.6982 0.8215 0.7275

30 0.3051 0.7419 0.7758 0.7313

35 0.3491 0.7748 0.744 0.7335

40 0.3023 0.7207 0.8105 0.7378

45 0.3891 0.7894 0.7125 0.725

50 0.4036 0.7947 0.7539 0.7508

55 0.3959 0.7893 0.7652 0.7543

60 0.3547 0.7637 0.77 0.7418

65 0.3583 0.7598 0.8118 0.7611

70 0.4073 0.8025 0.7636 0.7597

75 0.4379 0.7966 0.8099 0.7832

80 0.4537 0.811 0.8085 0.7901

85 0.4271 0.795 0.8045 0.7788

90 0.4415 0.8022 0.8038 0.7825

95 0.4215 0.7954 0.7943 0.7732

100 0.4007 0.7823 0.8137 0.7754

63



0 0 0.035 0.3514 0.0616

5 0.2057 0.6761 0.6114 0.6122

10 0.2564 0.7138 0.5723 0.6045

15 0.2526 0.6837 0.7181 0.6712

20 0.2761 0.7107 0.6882 0.6706

25 0.2433 0.6663 0.7969 0.6982

30 0.3014 0.7259 0.7454 0.7088

35 0.3338 0.7533 0.7214 0.7112

40 0.2967 0.6976 0.7905 0.7142

45 0.3509 0.757 0.7006 0.7021

50 0.3638 0.7673 0.7156 0.7158

55 0.3465 0.7585 0.7206 0.7132

60 0.3224 0.7329 0.7448 0.7126

65 0.3562 0.7563 0.7852 0.7471

70 0.3539 0.7701 0.7241 0.7213

75 0.3957 0.7820 0.7703 0.7537

80 0.3821 0.7759 0.7684 0.7497

85 0.3847 0.7719 0.7647 0.7453

90 0.392 0.7733 0.7613 0.7446

95 0.3798 0.7661 0.7623 0.7408

100 0.3832 0.7675 0.7813 0.7515

64



0 0 0.0469 0.4432 0.0824

5 0.1905 0.6878 0.6982 0.6614

10 0.1043 0.2817 0.1106 0.1457

15 0.2805 0.7435 0.728 0.7078

20 0.2937 0.7525 0.7289 0.7134

25 0.3102 0.7497 0.7564 0.7273

30 0.2734 0.7251 0.7653 0.7162

35 0.3583 0.7754 0.7646 0.7464

40 0.2476 0.6993 0.7053 0.6732

45 0.3718 0.7832 0.7632 0.7510

50 0.3504 0.7658 0.7759 0.7464

55 0.3725 0.7905 0.7451 0.7432

60 0.3322 0.7481 0.7747 0.7362

65 0.3912 0.799 0.7129 0.7298

70 0.38 0.7799 0.7899 0.7623

75 0.3893 0.7829 0.7718 0.7545

80 0.4058 0.7929 0.7628 0.7553

85 0.3916 0.7756 0.7909 0.7616

90 0.3703 0.7633 0.8023 0.7601

95 0.4168 0.8022 0.7664 0.7626

100 0.4225 0.7978 0.7669 0.7605

65



0 0 0.0468 0.4515 0.0825

5 0.1793 0.6564 0.6606 0.6278

10 0.0895 0.1947 0.0746 0.099

15 0.2364 0.7026 0.6979 0.6727

20 0.2685 0.728 0.7046 0.6891

25 0.2721 0.7171 0.7263 0.6951

30 0.2654 0.7125 0.742 0.7003

35 0.3155 0.7466 0.7437 0.7195

40 0.1862 0.6244 0.659 0.6099

45 0.3321 0.7555 0.7369 0.7219

50 0.3101 0.7474 0.7369 0.7172

55 0.3443 0.7726 0.7218 0.7216

60 0.3056 0.7344 0.7528 0.7184

65 0.3363 0.771 0.6876 0.7007

70 0.3358 0.7621 0.7536 0.7338

75 0.3406 0.7624 0.7429 0.7273

80 0.3413 0.7663 0.7231 0.7193

85 0.3367 0.7582 0.7478 0.7287

90 0.3235 0.7369 0.7771 0.7326

95 0.3581 0.774 0.7402 0.7332

100 0.3496 0.7656 0.7344 0.7256

66



0 0 0.05 0.4985 0.0884

5 0.2307 0.783 0.6188 0.6594

10 0.269 0.7751 0.7379 0.7257

15 0.2396 0.7361 0.8024 0.7374

20 0.3229 0.8074 0.7527 0.7514

25 0.3419 0.8053 0.7846 0.7688

30 0.3143 0.7863 0.7827 0.7582

35 0.3779 0.8271 0.7504 0.7625

40 0.3543 0.8078 0.8103 0.7834

45 0.4074 0.8352 0.7905 0.7899

50 0.3861 0.8172 0.8033 0.7868

55 0.389 0.814 0.8161 0.7914

60 0.4211 0.8453 0.7921 0.7953

65 0.3773 0.8124 0.8226 0.7934

70 0.4277 0.8392 0.8092 0.8021

75 0.3683 0.8022 0.8356 0.7947

80 0.3261 0.7789 0.831 0.7771

85 0.4553 0.8609 0.7853 0.8008

90 0.3914 0.8057 0.8465 0.8023

95 0.4139 0.8212 0.8341 0.8052

100 0.3834 0.7946 0.8628 0.8055

67



0 0 0.048 0.4932 0.0849

5 0.2321 0.748 0.6196 0.6476

10 0.2677 0.7515 0.73 0.7129

15 0.2411 0.723 0.7759 0.7205

20 0.308 0.7739 0.7461 0.7333

25 0.3122 0.7791 0.7557 0.7413

30 0.2992 0.7592 0.7628 0.7342

35 0.3619 0.8097 0.7315 0.7432

40 0.3611 0.8018 0.7994 0.7768

45 0.3702 0.8085 0.7663 0.762

50 0.3496 0.7953 0.7704 0.7586

55 0.3454 0.7877 0.7927 0.7658

60 0.4012 0.831 0.7656 0.7731

65 0.3432 0.7877 0.8027 0.7709

70 0.3758 0.8082 0.7842 0.7729

75 0.3687 0.7935 0.8171 0.7821

80 0.3328 0.7773 0.8106 0.7686

85 0.405 0.8347 0.7631 0.7742

90 0.3565 0.7852 0.82 0.7788

95 0.3661 0.7949 0.8121 0.7804

100 0.3501 0.7678 0.8494 0.7838

68



0 0 0.0468 0.4954 0.0829

5 0.1967 0.7341 0.5931 0.6202

10 0.1931 0.6905 0.7486 0.6871

15 0.1932 0.6908 0.7219 0.6721

20 0.2504 0.7178 0.7573 0.7076

25 0.2232 0.6895 0.7759 0.6968

30 0.3004 0.7543 0.7621 0.7319

35 0.3459 0.7763 0.765 0.7467

40 0.3308 0.7738 0.765 0.7443

45 0.3317 0.7618 0.773 0.7429

50 0.295 0.7375 0.7797 0.7308

55 0.3104 0.7535 0.7723 0.7368

60 0.3691 0.7875 0.7455 0.7418

65 0.3537 0.7716 0.7824 0.7536

70 0.3187 0.7588 0.766 0.7370

75 0.3409 0.7703 0.7748 0.7480

80 0.2921 0.7304 0.7961 0.7342

85 0.3482 0.7736 0.7575 0.7413

90 0.3452 0.7786 0.7586 0.7439

95 0.3667 0.7813 0.7898 0.7635

100 0.3648 0.7796 0.7586 0.7455

69



0 0 0.0446 0.4714 0.0790

5 0.2032 0.6977 0.5573 0.5875

10 0.2227 0.6841 0.7267 0.6763

15 0.2113 0.6809 0.6999 0.6607

20 0.2647 0.7076 0.7225 0.6873

25 0.2508 0.6848 0.7582 0.6905

30 0.2981 0.7261 0.7328 0.7034

35 0.3193 0.7521 0.7171 0.7081

40 0.3302 0.7563 0.7383 0.7220

45 0.3148 0.7321 0.7415 0.7111

50 0.3099 0.7243 0.7433 0.7075

55 0.3332 0.7475 0.7437 0.7210

60 0.3382 0.7588 0.7008 0.7029

65 0.3506 0.7523 0.7461 0.7257

70 0.3256 0.7461 0.7300 0.7131

75 0.3343 0.7452 0.7501 0.7229

80 0.3212 0.7280 0.7571 0.7165

85 0.3421 0.7497 0.7270 0.7126

90 0.3532 0.7585 0.7282 0.7187

95 0.3695 0.7598 0.7671 0.7404

100 0.3393 0.7549 0.7193 0.7111

70

APPENDIX C

ENSEMBLE NETWORK RESULTS

Table C.1: Digital Results: Ensemble - Dataset 1

File Accuracy Precision Recall F-Measure

bor ps6 0.5590 0.8535 0.9240 0.8711

chpn-p19 0.2673 0.8162 0.7746 0.7705

deb clai 0.2243 0.8897 0.7574 0.7922

deb menu 0.3356 0.7675 0.7513 0.7402

liz et6 0.3170 0.7718 0.6905 0.6990

liz et trans5 0.2074 0.7580 0.6803 0.6844

mz 331 2 0.6984 0.8951 0.9079 0.8906

mz 331 3 0.4805 0.7182 0.7887 0.7297

mz 333 2 0.7090 0.9120 0.9525 0.9217

mz 545 3 0.6609 0.6781 0.7127 0.6829

schuim-1 0.4335 0.8543 0.8482 0.8315

scn15 11 0.4565 0.8324 0.8526 0.8222

scn16 3 0.2428 0.7733 0.7716 0.7476

scn16 4 0.7670 0.9251 0.9139 0.9131

ty mai 0.7521 0.8947 0.9072 0.8930

Overall 0.4899 0.8372 0.8282 0.8135

71

Table C.2: Acoustic Results: Ensemble - Dataset 1


bor ps6 0.4026 0.8202 0.9157 0.8512

chpn-p19 0.2200 0.7794 0.7291 0.7212

deb clai 0.2030 0.8402 0.7810 0.7797

deb menu 0.3281 0.7506 0.7355 0.7232

liz et6 0.2985 0.7528 0.6653 0.6798

liz et trans5 0.1513 0.7173 0.5638 0.6006

mz 331 2 0.5432 0.8446 0.8751 0.8433

mz 331 3 0.4176 0.6675 0.7361 0.6733

mz 333 2 0.6384 0.8987 0.9171 0.8954

mz 545 3 0.6630 0.6730 0.6961 0.6726

schuim-1 0.3884 0.8442 0.7565 0.7704

scn15 11 0.4529 0.8503 0.7967 0.8009

scn16 3 0.2568 0.7669 0.7527 0.7378

scn16 4 0.6281 0.9141 0.9116 0.9050

ty mai 0.6451 0.8688 0.8757 0.8610

Overall 0.4291 0.8135 0.7910 0.7804



alb se2 0.4082 0.9261 0.8583 0.8760

bk xmas5 0.4568 0.9100 0.8629 0.8737

chpn-p19 0.2653 0.8415 0.7270 0.7556

grieg butterfly 0.5977 0.8724 0.8487 0.8434

liz rhap09 0.3869 0.7045 0.6112 0.6311

mz 331 3 0.4714 0.7107 0.7725 0.7184

mz 332 2 0.6380 0.9181 0.9162 0.9066

mz 333 2 0.6966 0.9102 0.9459 0.9178

mz 333 3 0.5761 0.8083 0.8978 0.8321

mz 545 3 0.6828 0.6785 0.7116 0.6826

mz 570 1 0.5848 0.8438 0.8358 0.8194

pathetique 1 0.4236 0.7938 0.8075 0.7799

schu 143 3 0.3244 0.8222 0.7152 0.7422

schuim-1 0.4172 0.8505 0.8277 0.8187

scn15 11 0.5138 0.8531 0.8573 0.8337

Overall 0.4930 0.8287 0.8118 0.8000

72



alb se2 0.2901 0.8721 0.8522 0.8460

bk xmas5 0.4187 0.8938 0.8762 0.8746

chpn-p19 0.2283 0.7961 0.6871 0.7049

grieg butterfly 0.5291 0.8566 0.8387 0.8305

liz rhap09 0.3582 0.6705 0.5692 0.5932

mz 331 3 0.4178 0.6420 0.7029 0.6396

mz 332 2 0.4128 0.8717 0.8301 0.8312

mz 333 2 0.6342 0.9030 0.9039 0.8906

mz 333 3 0.5685 0.8038 0.8628 0.8155

mz 545 3 0.6547 0.6668 0.6794 0.6604

mz 570 1 0.4811 0.8248 0.7706 0.7727

pathetique 1 0.3693 0.7857 0.7283 0.7333

schu 143 3 0.2733 0.7947 0.6821 0.7114

schuim-1 0.3722 0.8399 0.7249 0.7487

scn15 11 0.4374 0.8543 0.7791 0.7936

Overall 0.4296 0.8067 0.7615 0.7618



alb se2 0.4196 0.9291 0.8498 0.8721

bk xmas4 0.4395 0.8833 0.8049 0.8196

bor ps6 0.4790 0.8561 0.8974 0.8590

deb clai 0.2217 0.9128 0.7122 0.7748

liz et trans5 0.1922 0.7755 0.5842 0.6320

liz rhap09 0.3947 0.7120 0.5979 0.6256

mz 545 3 0.7217 0.6932 0.6964 0.6845

mz 570 1 0.6004 0.8529 0.8225 0.8184

pathetique 1 0.4419 0.8066 0.7937 0.7814

scn15 11 0.5118 0.8676 0.8254 0.8283

scn15 12 0.2749 0.8980 0.6989 0.7716

scn16 3 0.2448 0.7948 0.7409 0.7436

scn16 4 0.7486 0.9247 0.8959 0.9004

ty maerz 0.6585 0.9089 0.9088 0.8993

ty mai 0.7864 0.9042 0.9044 0.8970

Overall 0.4612 0.8306 0.7633 0.7745

73



alb se2 0.3099 0.8902 0.8361 0.8443

bk xmas4 0.4307 0.8668 0.8257 0.8262

bor ps6 0.4726 0.8538 0.8909 0.8578

deb clai 0.1924 0.8695 0.7403 0.7685

liz et trans5 0.1441 0.7336 0.4674 0.5406

liz rhap09 0.3518 0.6782 0.5394 0.5762

mz 545 3 0.6764 0.6764 0.6642 0.6578

mz 570 1 0.4871 0.8341 0.7549 0.7682

pathetique 1 0.3769 0.7930 0.6939 0.7169

scn15 11 0.4247 0.8561 0.7598 0.7830

scn15 12 0.1686 0.8575 0.6151 0.6911

scn16 3 0.245 0.7883 0.7110 0.7249

scn16 4 0.6304 0.9296 0.8903 0.8997

ty maerz 0.4690 0.8915 0.8648 0.8601

ty mai 0.6337 0.8806 0.8585 0.8586

Overall 0.3907 0.8081 0.7154 0.7353



alb se2 0.3894 0.9050 0.8816 0.8782

bk xmas1 0.3643 0.9011 0.8349 0.8490

bk xmas4 0.4097 0.8589 0.8305 0.8220

bor ps6 0.5238 0.8589 0.9104 0.8685

chpn-p19 0.2547 0.8345 0.7659 0.7736

deb clai 0.1855 0.8920 0.7220 0.7708

deb menu 0.3163 0.7571 0.7674 0.7426

liz et6 0.3282 0.7710 0.6960 0.7005

mz 332 2 0.6123 0.9042 0.9251 0.9032

mz 333 2 0.6905 0.9023 0.9601 0.9199

mz 333 3 0.5134 0.7769 0.9135 0.8195

mz 545 3 0.6517 0.6624 0.7235 0.6792

mz 570 1 0.5719 0.8324 0.8502 0.8221

schuim-1 0.4224 0.8371 0.8585 0.8293

ty maerz 0.6140 0.8808 0.9257 0.8917

Overall 0.4690 0.8457 0.8490 0.8274

74



alb se2 0.2948 0.8404 0.8796 0.8429

bk xmas1 0.4623 0.8716 0.9058 0.8757

bk xmas4 0.3734 0.8301 0.8654 0.8297

bor ps6 0.5106 0.8424 0.9047 0.8579

chpn-p19 0.2220 0.7870 0.7259 0.7236

deb clai 0.2030 0.8569 0.7722 0.7839

deb menu 0.3293 0.7287 0.7642 0.7271

liz et6 0.2955 0.7424 0.6881 0.6882

mz 332 2 0.4390 0.8642 0.8475 0.8372

mz 333 2 0.6280 0.8927 0.9169 0.8919

mz 333 3 0.5460 0.7889 0.8879 0.8176

mz 545 3 0.6538 0.6628 0.7109 0.6731

mz 570 1 0.4492 0.8072 0.7890 0.7734

schuim-1 0.3981 0.8464 0.7678 0.7780

ty maerz 0.4107 0.8414 0.9108 0.8558

Overall 0.4296 0.8246 0.8284 0.8057



bk xmas1 0.3617 0.9155 0.8128 0.8441

bk xmas4 0.4477 0.8856 0.8521 0.8559

bor ps6 0.5504 0.8700 0.9056 0.8720

chpn-e01 0.1296 0.8763 0.7338 0.7744

deb menu 0.3262 0.7685 0.7405 0.7365

grieg butterfly 0.5707 0.8735 0.8415 0.8389

liz et trans5 0.1941 0.7705 0.6345 0.6623

liz rhap09 0.3742 0.7008 0.6354 0.6433

mz 333 3 0.5783 0.8138 0.9066 0.8390

mz 545 3 0.6709 0.6801 0.7042 0.6807

pathetique 1 0.4406 0.7940 0.8225 0.7894

schuim-1 0.4137 0.8567 0.8347 0.8252

scn15 11 0.4491 0.8457 0.8371 0.8183

scn15 12 0.2678 0.8805 0.7168 0.7743

ty maerz 0.6070 0.8942 0.9099 0.8921

Overall 0.4176 0.8132 0.7844 0.7779

75



bk xmas1 0.4778 0.8888 0.8884 0.8756

bk xmas4 0.4048 0.8581 0.8562 0.8421

bor ps6 0.5192 0.8611 0.9007 0.8656

chpn-e01 0.1029 0.7778 0.7696 0.7502

deb menu 0.3496 0.7498 0.7424 0.7270

grieg butterfly 0.5625 0.8642 0.8391 0.8355

liz et trans5 0.1493 0.7422 0.5302 0.5866

liz rhap09 0.3685 0.6724 0.5969 0.6089

mz 333 3 0.6164 0.8269 0.8814 0.8371

mz 545 3 0.6632 0.6726 0.6908 0.6696

pathetique 1 0.4090 0.7875 0.7279 0.7345

schuim-1 0.3814 0.8539 0.7476 0.7704

scn15 11 0.4435 0.8512 0.7917 0.8011

scn15 12 0.1943 0.8389 0.6361 0.7010

ty maerz 0.4246 0.8729 0.8684 0.8504

Overall 0.4093 0.7960 0.7501 0.7501

76

NEURAL NETWORKS FOR AUTOMATIC POLYPHONIC PIANO …

Documents

Transcript of NEURAL NETWORKS FOR AUTOMATIC POLYPHONIC PIANO …