Methodology - Indiana University Bloomington · Web viewKlengel – H 11 1 Bach: Violin Partita no....

OMR Evaluation and Prospects for Improved OMR via Multiple Recognizers

Donald ByrdSchool of Informatics and School of Music

William GuerinSchool of Music

Indiana UniversityMegan Schindele??WHERE NOW?

Ian Knopke??WHERE NOW?

29 June 2009

AbstractOMR (Optical Music Recognition) programs have been available for many years, but—depending on the complexity of the music, among other things—they still leave much to be desired in terms of accuracy. We studied the feasibility of achieving substantially better accuracy by using the output of several programs to “triangulate” and get better results than any of the individual programs; this multiple-recognizer approach has had some success with other media but, to our knowledge, had never been tried for music until recently. A major obstacle is that the complexity of music notation is such that comparing two versions of a page of music in a meaningful way is extremely difficult for any but the simplest music; this is a problem both for developing a multiple-recognizer OMR system, and for evaluating the accuracy of any OMR system. Existing programs have serious enough limitations that the multiple-recognizer approach is promising, but substantial progress on OMR evaluation is probably essential before much progress on OMR can be made, regardless of the approach; it is certainly essential before anyone can know that progress has been made.Keywords: Optical Music Recognition, OMR, classifier, recognizer, evaluation

Introduction This report describes research, part of the MeTAMuSE (Methodologies and Technologies for Advanced Musical Score Encoding) project, on a multiple-recognizer approach to improved OMR (Optical Music Recognition). To our knowledge, such an approach had never been tried with music before MeTAMuSE, though it has been in use for some time in other domains, in particular OCR (Optical Character Recognition). However, the most important result of the project is something that applies equally to all OMR, not just the exotic kind we were concerned with: the state of the art of

1

evaluating OMR is so bad that, in general, there is no way to say how well a given system does, either in absolute terms or compared to another system. Substantial progress on evaluation is probably essential before much progress on OMR can be made, regardless of the approach; it is certainly essential before anyone can know that progress has been made.

To some extent, this unfortunate situation is an outcome of the enormous complexity of music notation; but to a considerable extent, it results simply from a near-total absence of standards. No standard test databases exist, and no standards exist for judging the difficulty for OMR purposes of an image of some music, either in terms of graphic quality or of the complexity of the music itself. Standard ways of counting errors are also nonexistent, though this is undoubtedly due in part to the complexity of notation. There is no easy way to make music notation less complex, but standards are another matter, and we make some proposals in that direction.

1. OMR: Recognizers and Multiple Recognizers in Text and MusicThe basis of an optical symbol-recognition system of any type is a recognizer, an algorithm that takes an image that the system suspects represents one or more symbols and decides which, if any, of the possible symbols to be recognized the image contains. The recognizer works by first segmenting the image into subimages, then applying a classifier, which decides for each subimage on a single symbol or none. The fundamental idea of a multiple-recognizer system is to take advantage of several pre-existing but imperfect systems by comparing their results to “triangulate” and get substantially higher accuracy than any of the individual systems. This is clearly a form of N-version programming, and it has been implemented for OCR by Prime Recognition. Its creators reported a very substantial increase in accuracy (Prime Recognition, 2005); they gave no supporting evidence, but the idea of improving accuracy this way is certainly plausible. The point of multiple-recognizer OMR (henceforth “MR” OMR) is, of course, to do the same with music notation. (Of the many forms of music notation in existence, most OMR work focuses on Conventional Western Music Notation, and the current research considers only that form.) The basic question for a MROMR system is how best to merge the results of the constituent single-recognizer (henceforth “SR” OMR) systems, i.e., how to resolve disagreements among them in the way that increases accuracy the most.

The simplest merging algorithm for a MR system is to take a “vote” on each symbol or sequence of symbols and assume that the one that gets the most votes is correct. (Under ordinary circumstances—at least with text—a unanimous vote is likely on most symbols.) This appears to be what the Prime Recognition system does, with voting on a character-by-character basis among as few as three or many as six SR systems. A slightly more sophisticated approach is to test in advance for the possibility that the SR systems are of varying accuracy, and, if so, to weight the votes to reflect that.

2

But music is so much more complex than text that such simple approaches appear doomed to failure. To clarify the point, consider an extreme example. Imagine that system A always recognizes notehead shapes and flags (in U.K. parlance, “tails”) on notes correctly; system B always recognizes beams correctly; and system C always recognizes augmentation dots correctly. Also say that each does a poor job of identifying the symbols the others do well on, and hence a poor job of finding note durations. Even so, a MROMR system built on top of them and smart enough to know which does well on which symbols would get every duration right! System A might find a certain note—in reality, a dotted-16th note that is part of a beamed group—to have a solid notehead with no flags, beams, or augmentation dots; B, two beams connected (unreasonably) to a half-note head with two dots; C, an “x” head with two flags and one augmentation dot. Taking A’s notehead shape and (absence of) flags, B’s beams, and C’s augmentation dots gives the correct duration. See Figure 1.

Figure 1.

For music, then, it seems clear that one should begin by studying the pre-existing systems in depth, not just measuring their overall accuracy, and looking for specific rules describing their relative strengths and weaknesses that an MROMR system can exploit.

It should also be noted that few high-quality SR systems exist for music, so it is important to get as much information as possible from each.

1.1 AlignmentMusic as compared to text presents a difficulty of an entirely different sort. Before you can even think of comparing the symbolic output of several systems, regardless of the type of material, clearly you must know which symbols output by each system correspond, i.e., you must align the systems’ output. (Note that we use the verb “to align” in the computer scientist’s symbol-matching sense, not the usual geometric sense.) Aligning two versions of the same text that differ only in the limited ways to be expected of OCR is very straightforward. But with music, even monophonic music, the plethora of symbols and of relationships among them (articulation marks and other symbols “belong” to notes; slurs, beams, etc., group notes horizontally, and chords group them vertically) makes it much harder. And, of course, most music of interest to most potential users is not monophonic. A fair amount of research has been done on aligning music in what might be called “semi-symbolic” form, where the only symbols are notes, and the only relationships are notes’ membership in logical voices plus temporal ordering within each voice; see for example Kilian & Hoos (2004) or Pardo & Sanghi.

3

But little or no work has been done on alignment of music in fully-symbolic form. As a minimum, we need barlines as well as the structure just mentioned, for reasons discussed below. Our own work is reported in Knopke & Byrd (2007) and herein.

1.2. Music Notation Semantics and OMR EvaluationObviously, the only way to really demonstrate the value of a MROMR system would be to implement one, test it, and obtain results showing its superiority. However, the evaluation of OMR systems in general has proven to be an extremely difficult problem. Some fourteen years have passed since the groundbreaking study by Nick Carter and others at the CCARH (Selfridge-Field, Carter, et al, 1994); remarkably little progress has been made since then. Any discussion of OMR evaluation must begin with a phenomenon that is discussed more or less directly by Reed (1995), Bainbridge & Bell (2001), Droettboom & Fujinaga (2004), and Bellini et al (2007): document-recognition systems can be described at different levels, and they may behave very differently at each level. OMR systems in particular can be described at the pixel level, but the more interesting levels begin with what can perhaps best be called (in the terminology of Bellini et al 2007) basic symbols (graphic elements: noteheads, flags, the letter "p", etc.) and composite or complete symbols (things with semantics: eighth notes with flags, beamed eighth notes, chords, dynamic marks like "p", "pp", and "mp", etc.). There are only a relative handful of basic symbols, but a huge number of composite symbols. Droettboom & Fujinaga comment:

In many classification problems the evaluation metric is fairly straightforward. For example at the character level of OCR, it is simply a matter of finding the ratio between correctly identified characters and the total number of characters. In other classification domains, this is not so simple, for example document segmentation, recognition of maps, mathematical equations, graphical drawings, and music scores. In these domains, there are often multiple correct output representations, which makes the problem of comparing a given output to highlevel groundtruth very difficult. In fact, it could be argued that a complete and robust system to evaluate OMR output would be almost as complex and error-prone as an OMR system itself. Symbol-level analysis may not be directly suitable for comparing commercial OMR products, because such systems are usually “black boxes” that take an image as input and produce a score-level representation as output.

Note particularly the last statement; their “symbols” are probably Bellini et al’s “basic symbols”, but their “score-level representation” is a level above Bellini et al’s “composite-symbol-level representation”. In fact, these problems of OMR directly reflect the intricate semantics of music notation. The first note in Figure 1 is a composite symbol consisting of notehead, augmentation dot, stem, and beams (the latter shared with the next note); its duration of a dotted-16th note is clear from the composite symbol. However, to see that its pitch is (in ISO notation) G5 requires taking into account the clef, which is an independent symbol; its pitch could also be affected by an octave sign, a key signature, and accidentals on preceding notes in its measure.

4

As another example of the context dependency of music-notation symbols, a dot following a notehead means one thing; a dot above or below a notehead means something else. In the third measure of Figure 2, certain programs misinterpret each of the staccato dots above the notes as an augmentation dot following the preceding note: this is serious because it gives the notes the wrong duration.

Figure 2.

Bainbridge & Bell (2001) give a well-thought-out and well-written discussion of the problems. They comment that “Reed demonstrates the problem [of calculating accuracy] with an incisive example in which he presents, from the same set of data, two different calculations that yield accuracy levels of 11% and 95% respectively.“ Reed’s example (Reed 1995, p. 73) is Figure 3: the only problem in the reconstructed score is a single missing beam. The low accuracy rate is based on counting composite (high-level) symbols: notes and slurs (1 of 9 is correct); the high rate is from counting basic (low-level) symbols, namely stems, beams, noteheads, and slurs (19 of 20 correct). This example shows how a mistake in one basic symbol can cause numerous errors in composite symbols, and therefore (in this case) in note durations. But context dependence in music notation is not just a matter of basic vs. composite symbols. Along the same lines, mistaking one clef for another is likely to cause errors in note pitches for many following measures: easily dozens, and quite possibly hundreds of notes.

Figure 3.

To sum up, evaluating OMR systems presents at least three major problems.

1. At what level should errors be counted? Droettboom & Fujinaga (2004) point out that “a true evaluation of an OMR system requires a high-level analysis, the automation of which is a largely unsolved problem.” This is particularly true for the “black box” commercial SROMR programs,

5

offering no access to their internal workings, against which we would need to evaluate the MROMR. And the manual techniques available without automation are, as always, costly and error-prone. So neither high-level nor lower-level evaluation is completely satisfactory.

2. Number of errors vs. effort to correct. It is not clear whether an evaluation should consider the number of errors or the amount of work necessary to correct them. The latter is more relevant for many purposes, but it is very dependent on the tools available, e.g., for correcting the pitches of notes resulting from a wrong clef. As Ichiro Fujinaga has pointed out (personal communication, March 2007), it also depends greatly on the distribution and details of the errors: it is far easier to correct 100 consecutive eighth notes that should all be 16ths, than to correct 100 eighth notes whose proper durations vary sprinkled throughout a score. Finally (and closely related), should “secondary errors” clearly resulting from an error earlier in the OMR process be counted or only primary ones? For example, programs sometimes misrecognize a clef, and as a result get the wrong pitch for every note on its staff. Of course, if one is counting errors at Bellini et al’s “basic symbol” level, this is a non-issue. Bainbridge & Bell (2001) discuss counting errors in terms of operations for an imaginary music editor designed with OMR in mind; in an absolute sense, this might be the best method, but until such an editor exists, it may not be very helpful.

3. Relative importance of symbols. With media like text, it is reasonable to assume that all symbols and all mistakes in identifying them are equally important. With music, that is not even remotely the case. It seems clear that note durations and pitches are the most important things, but after that, nothing is obvious. How important are redundant or cautionary accidentals? Fingerings? Mistaking note stems for barlines and vice-versa both occur; are they equally serious?

Two interesting recent attempts to make OMR evaluation more systematic that should be better-known are Ng et al (2005) and especially Bellini et al (2007). Bellini et al attempt to reach meaningful conclusions by distinguishing carefully between basic symbols and composite or complete symbols, as defined above. They then use the results of a survey of experts to assign weights to different errors. Ng et al describe methodologies for OMR evaluation in different situations, based on Bellini et al’s conclusions.

2. Our Research

2.0.1 Related WorkSince we began the current project, one OMR system has appeared that actually claims to use MROMR: the commercial program PhotoScore Ultimate. The publisher says that it incorporates the recognition engines of SharpEye (which they had acquired rights to) and the previous version of PhotoScore. However, almost no details of its operation are available; it is a “black box” program, like nearly all commercial programs. In any case, their situation is very different from and a great deal simpler than ours. Only

6

two SR systems are involved, and the developers of PhotoScore Ultimate undoubtedly had access to the source code for both, so for them, there were no black boxes.

We know of no previous work that is closely related to ours. Classifier-combination systems have been studied in other domains, but recognition of music notation involves major issues of segmentation as well as classification, so our problem is substantially different. However, a detailed analysis of errors made by the SharpEye program of a page of a Mozart piano sonata is available on the World Wide Web (Sapp 2005); it is particularly interesting because scans of the original page, which is in the public domain (at least in the U.S.), are available from the same web site, and we used the same page—in fact, the same scans—ourselves. But this is still far from giving us real guidance on how to evaluate ordinary SR, much less MR, OMR.

2.1 Methodology

2.1.1 Test Data and ProceduresWhile (in the words of Droettboom & Fujinaga 2004) “a true evaluation of an OMR system” would be important for a working MROMR system, determining rules for designing one is rather different from the standard evaluation situation. The idea here is not to say how well a system performs by some kind of absolute measures, but—as we have pointed out—to say in as much detail as possible how the SROMR systems compare to each other.

What are the appropriate metrics for such a comparison? We considered three questions, two of them already mentioned in our discussion of evaluation. (a) Do we care more about minimizing number of errors, or about minimizing time to correct? Also (and closely related), (b) should we count “secondary errors” or only primary ones? Finally, (c) how detailed a breakdown of symbols and error types do we want? In order to come up with a good multiple-recognizer algorithm, we need the best possible resolution (in an abstract sense, not a graphical one) in describing errors. Therefore the answer to (a) is minimizing number of errors; to (b) is, to the greatest extent possible, we should not count secondary errors. For item (c), consider the “extreme example” of three programs attempting to find note durations we discussed before, where all the information needed to reconstruct the original notes is available, but distributed among the programs. So, we want as detailed a breakdown as possible in terms of both symbols and error types.

We adopted a strategy of using as many approaches as possible to gathering the data. Following criteria laid out in our working document “Guidelines for MeTAMuSE MROMR Test Pages (and factors affecting OMR accuracy)”, for a first study (2005), we assembled a test collection of about 5 full pages of “artificial” examples, including the fairly well-known “OMR Quick-Test” (Ng & Jones, 2003), and 20 pages of real music from published editions (Table 1). The collection has versions of all pages at 300 and 600 dpi, with 8 bits of grayscale; we chose these parameters based on information from Fujinaga

7

and Riley (2002, plus personal communication from both, July 2005). In most cases, we scanned the pages ourselves; in a few, we used page image files produced directly via notation programs or sent to us. With this collection, we planned to compare fully-automatic and therefore objective (though necessarily very limited) measures with semi-objective hand error counts, plus a completely subjective “feel” evaluation.

In 2008, we did a second study, using the heterogeneous collection just described plus a much more homogeneous collection: music for solo recorder from Der fluyten lust-hof, by the Baroque composer Van Eyck (Table 2).

We also considered documentation for the programs and statements by expert users. With respect to the latter, we asked experts on each program for the optimal settings of their parameters, among other things, intending to rely on their advice for our experiments. Their opinions appear in our working document “Tips from OMR Program Experts”. However, it took much longer than expected to locate experts on each program and to make clear what we needed; as a result, some of the settings we used probably were not ideal, at least for our repertoire. On the other hand, as time went on, it became more and more evident that, in any case, there’s no systematic way to choose optimal settings; see the discussion of black and white vs. grayscale settings below.

The procedures we used in 2005 are described in detail in our working document “Procedures and Conventions for the Music-Encoding SFStudy”. Our 2008 procedures were not too different; they’re described in “Procedures and Conventions for MeTAMuSE OMR Studies”.

2.1.2 The MROMR Process and The Strategy for Finding RulesThe MROMR process depends on a set of rules for merging the results of the SROMR programs involved into a single version of the music more accurate than any of the individual programs’. The diagram below shows how the process would work in a production environment.

8

The question is then how to find useful rules, and the strategy we developed is as follows.

Perform OMR on a collection of scanned pages, using several programs on each. The general critieria we used for selecting pages are described in our working paper “Guidelines for MeTAMuSE MROMR Test Pages (and factors affecting OMR accuracy)”.

Do a rough analysis of the OMR results (by counting errors of various types, etc.) to see if there are obvious recurring classes of errors, or patterns in the behavior of the individual programs (whether strengths or weaknesses) that might lead to identifying useful rules.

Come up with a hypothesis for the cause of the error. As far as possible, isolate the situation causing the error. For instance, this

could involve stripping away most of the scanned image, leaving just the measure in question, and confirming that the error persists.

Where appropriate, duplicate the scenario with an artificial sample of music and an image generated by a notation program, and see if it still generates the error.

Manipulate the artificial sample to determine the exact mix of features that will cause the error to occur. In this situation, a large of number of automatically-generated groundtruth and image files could be useful.

If possible, see if there is a way to glean the existence of the appropriate error-causing conditions solely from XML.

2.1.3 Programs TestedA table of OMR systems is given by Byrd (2007). We did two studies, one with three of the leading commercial programs available in spring 2005, and one with four of the leading programs (including updated versions of two of the 2005 ones) and a larger amount of music in spring and summer 2008. For the first study, we used PhotoScore 3.10, SmartScore 3.3 Pro, and SharpEye 2.63. For the second, we worked with Capella-Scan 6.1, PhotoScore Ultimate 5.5, SmartScore 5.3 Pro, and SharpEye 2.67. All are distributed more-or-less as conventional “shrink wrap” programs, effectively

9

“black boxes” as the term is defined above. In 2005 we also considered Gamut/Gamera, one of the leading academic-research systems (MacMillan, Droettboom, & Fujinaga, 2002); for the second study, we tested Audiveris, a new open-source program. Aside from its excellent reputation, Gamut/Gamera offered major advantages in that its source code was available to us, along with deep expertise on it, since Ichiro Fujinaga and Michael Droettboom, its creators, were our consultants. However, Gamut/Gamera is fully user-trainable for the “typography” of any desired corpus of music. Of course, this flexibility would be invaluable in many situations, but, at the time, training data was available only for 19th-century sheet music; 20th-century typography for classical music is different enough that the results were too inaccurate to be useful, and it was felt that re-training it would have taken “many weeks” (Michael Droettboom, personal communication, June 2005). And we simply encountered too many problems with Audiveris to be able to use it: not too surprising for a new open-source program of such complexity.

2.1.4 Automatic Comparison and Its ChallengesVisualization of any research involving music in notation form is an issue: it invariably takes a lot of time, but being able to see what you’re doing is tremendously helpful. We chose to invest the time; using Lilypond to generate snippets of notation, we spent much effort on visualization aids (Knopke & Byrd 2007; Knopke 2008).

We attempted fully-automatic comparison of MusicXML files generated from the OMR programs via approximate string matching. For the main study, we used an alignment method based on the Needleman-Wunsch generalization of Levenshtein distance. We first tried aligning corresponding measures output by all programs, one measure at a time. This didn’t work well because the programs sometimes mistook barlines for note stems and vice-versa, each occurrence resulting in complete misalignment of all following measures. So we implemented global alignment with a large penalty for aligning notes and barlines (Knopke & Byrd, 2007). This technique gave much better results , though still not as good as we would have liked. However, there are obvious ways to fine-tune the algorithm that we didn't have time to test, so there’s reason for optimism.

With this automation, we had expected to be able to test a large amount of music, following the strategy described above; but by the time it was working, it was too late to do so.

2.1.5 Hand Error Count, 2005The first hand count of errors, in three pages of artificial examples and eight of published music (pages indicated with an “H” in the next-to-last column of Table 1), was relatively coarse-grained in terms of error types. We distinguished only seven types of errors, namely:

Wrong pitch of note (even if due to extra or missing accidentals) Wrong duration of note (even if due to extra or missing augmentation dots)

10

Misinterpretation (symbols for notes, notes for symbols, misspelled text, slurs beginning/ending on wrong notes, etc.)

Missing note (not rest or grace note) Missing symbol other than notes (and accidentals and augmentation dots) Extra symbol (other than accidentals and augmentation dots) Gross misinterpretation (e.g., missing staff)

For consistency, we (Byrd and Schindele) evolved a fairly complex set of guidelines for counting errors; these are listed in the “Procedures and Conventions” working document mentioned above. All the hand counting was done us, and nearly all by a single person (Schindele), so we are confident our results have a reasonably high degree of consistency.

2.1.6 “Feel” Evaluation, 2005We also did a completely subjective “feel” evaluation of a subset of the music used in the hand error count, partly as a so-called reality check on the other results, and partly in the hope that some unexpected insight would arise that way. The eight subjects were music librarians and graduate-student performers affiliated with a large university music school. We gave them six pairs of pages of music—an original, and a version printed from the output of each OMR program of a 300-dpi scan—to compare. There were two originals, Test Page 8 in Table 1 (Bach, “Level 1” complexity) and Test Page 21 (Mozart, “Level 2”), with the three OMR versions of each making a total of six pairs. The Mozart page is the same one used by Sapp (2005); in fact, we used his scans of the page.

Based on results of the above tests, we also created and tested another page of artificial examples, “Questionable Symbols”, intended to highlight differences between the programs: we will say more about this later.

2.1.7 Hand Error Count, 2008The second hand count of errors involved considerably more music, totaling 23 pages: three pages of artificial examples and 20 of published music (pages indicated with an “H” in the last column of Table 1 and everything in Table 2). was relatively coarse-grained in terms of error types. We distinguished the same types of errors, but we were much more careful to avoid counting secondary errors.

We started with the error-counting guidelines of the first study, but they evolved somewhat, as listed in the 2008 “Procedures and Conventions” working document mentioned above. The counting was done by two people (Abby Shupe and John Reef); then, for consistency, the work of one was reviewed by the other.

11

Table 1. “SFStudy” Collection

Test pag

e no.

Cmplx. Level

Title Catalog or other

no.

Publ. date

Display page

no.

Edition or Source

Eval. Statu

s (2005

)

Eval. Status

(2008)

1 1x OMR Quick-Test Cr.2003 1 IMN Web site H H2 1x OMR Quick-Test Cr.2003 2 IMN Web site H H3 1x OMR Quick-Test Cr.2003 3 IMN Web site H H4 1x Level1OMRTest1 Cr.2005 1 DAB using

Ngale– –

5 1x AltoClefAndTie Cr.2005 1 MS using Finale

– –

6 1x CourtesyAccidentalsAndKSCancels Cr.2005 1 MS using Finale

– –

7 1 Bach: Cello Suite no.1 in G, Prelude

BWV 1007

1950 4 Barenreiter/Wenzinger

C H


BWV 1007

1967 2 Breitkopf/Klengel

HF H

9 1 Bach: Cello Suite no.3 in C, Prelude

BWV 1009

1950 16 Barenreiter/Wenzinger

– H


BWV 1009

1967 14 Breitkopf/Klengel

– H

11 1 Bach: Violin Partita no. 2 in d, Gigue

BWV 1004

1981 53 Schott/Szeryng – H

12 1 Telemann: Flute Fantasia no. 7 in D, Alla francese

1969 14 Schirmer/Moyse

H H

13 1 Haydn: Qtet Op. 71 #3, Menuet, viola part

H.III:71 1978 7 Doblinger H H

14 1 Haydn: Qtet Op. 76 #5, I, cello part

H.III:79 1984 2 Doblinger – H

15 1 Beethoven: Trio, I, cello part Op. 3 #1 1950-65 3 Peters/Herrmann

H H

16 1 Schumann: Fantasiestucke, clarinet part

Op. 73 1986 3 Henle H

17 1 Mozart: Quartet for Flute & Strings in D, I, flute

K. 285 1954 9 Peters – H

18 1 Mozart: Quartet for Flute & Strings in A, I, cello

K. 298 1954 10 Peters – H


BWV 1007

1879 59/pt Bach Gesellschaft

H H


BWV 1009

1879 68/pt Bach Gesellschaft

– H

21 2 Mozart: Piano Sonata no. 13 in Bb, I

K. 333 1915 177 Durand/Saint-Saens

HF H

22 3 Ravel: Sonatine for Piano, I 1905 1 D. & F. – -23 3 Ravel: Sonatine for Piano, I 1905 2 D. & F. – -24 3x QuestionableSymbols Cr.

20051 MS using

Finale– -

Publication date: Cr. = creation date for unpublished items.Display page no.: the printed page number in the original. “/pt” means only part of the page.Hand/feel eval. status: H = hand error count done, F = “feel” evaluation done.Definition of Level 1 music: monophonic with one staff per system; simple rhythm (no tuplets) and pitch notation (no 8vas); no lyrics, grace notes, cues, or clef

12

changes. 1x means artificial (“composed” for testing); 1 is the same, but “real” music, with various complications.Definition of Level 2 music: one voice per staff; not-too-complex rhythm (no tuplets) and pitch notation (no 8vas), no lyrics, cues, or clef changes; no cross-system anything (chords, beams, slurs, etc.). (Haydn's Symphony no. 1, used in the CCARH 1994 study, exceeds these limits with numerous triplets, both marked and unmarked, and with two voices on a staff in many places.)Definition of Level 3 music: anything else. 3x means artificial.

13

Table 2. Van Eyck Collection (in part)

Test page no.

Complx. Level Title Publ.

dateDisplay page

no.Edition or Source

2 2 2. Onse Vader in Hemelryck, modo 3 1986 2 New Vellekoop Edition



7 2 4. Psalm 118, modo 5 1986 8 New Vellekoop Edition

12 2 10. Sarabande 1986 15 New Vellekoop Edition

12 2 10. Sarabande, modo 2 1986 15 New Vellekoop Edition

13 2 11. Rosemont 1986 16 New Vellekoop Edition

13 2 11. Rosemont, modo 2 1986 16 New Vellekoop Edition

13 2 11. Rosemont, modo 3 1986 16 New Vellekoop Edition

14 2 12. Courant, of Ach treurt myn bedroefde 1986 17 New Vellekoop

Edition14 2 12. Courant, of Ach treurt myn

bedroefde, modo 2 1986 17 New Vellekoop Edition

14 2 13. d'Lof-zangh Marie 1986 17 New Vellekoop Edition



2.2 Intellectual Property Rights and Publication DatesAn important consideration for any serious music-encoding project, at least for academic purposes, is the intellectual-property status of the publications to be encoded. Obviously the safest approach, short of potentially spending large amounts of time and money to clear the rights, is to use only editions that are clearly in the public domain. In general, to cover both Europe and the U.S., this means that the composer must have been dead for more than 70 years, and the edition must have been published before 1923. This restriction is somewhat problematic because music-engraving practice has changed gradually over the years, and perhaps less gradually since the advent of computer-set music, and the commercial OMR programs are undoubtedly optimized for relatively-recent editions.

An intellectual-property question of an entirely different sort is whether a multiple-classifier OMR system would infringe anyone's intellectual property rights. We asked a well-regarded intellectual-property law firm, Origin Ltd., to investigate both questions under both U.S. and European law and overall,

14

their answers were reassuring; see working document “Advice to Professor Geraint Wiggins of Goldsmiths College on IPR Issues”.

For our research, as Table 1 shows, we used a mixture of editions from the public-domain and non-public-domain periods, plus some very recent computer-set examples.

3. ResultsThanks to our difficulties with the fully-automatic system, we ended up relying almost entirely on the hand error counts, plus expert opinions and documentation. The results following attempt to reveal the strengths and weaknesses of each program, with an eye toward integrating that data into a future MROMR system.

3.1 Hand Error Count Results, 2005Results of the hand error counts are tabulated in our working document “OMR Page Hand Error Count Report for SFStudy”. The tables and graphs below demonstrate some of the main points:

With our test pages, the higher resolution didn’t really help. This was as expected, given that none of the test pages had very small staves. In fact, it didn’t have much effect at all on SharpEye and SmartScore. PhotoScore was much less predictable: sometimes it did better at the higher resolution, sometimes worse.

The programs generally had more trouble with more complex and more crowded pages. This was also as expected.

Despite its generally good accuracy and its high rating in the “feel” evaluation (see below), SharpEye made many errors on note durations. Most of these are because of its problems recognizing beams: see Conclusions below for details.

By the guidelines of Byrd (2004), which seem reasonable for high-quality encodings, all three programs were well above acceptable limits in terms of note pitch and duration for multiple pages. For example, SharpEye had error rates for note duration over 1% on several of our test pages, and over 4% on one (see Figure 7). Byrd’s guidelines say 0.5% is the highest acceptable rate. For note pitch, SharpEye had an error rate over 1% for one test page and nearly 1% for another (Figure 6), while Byrd’s limit for pitch error is only 0.2%.

We also did a more detailed count of errors in a few specific symbols other than notes: C clefs, text, hairpin dynamics, pedal-down marks, and 1st and 2nd endings. See working document ”OMR Detailed Symbol Rule Report for SFStudy”.

15

Figure 4. Total Errors at 300 dpi (old programs & rules)

0

20

40

60

80

100

120

Test Page No. and Errors by Program

PhotoScore 31 22 27 22 25 68 40 93 24 36 81SharpEye 4 2 17 7 8 55 22 55 17 42 25SmartScore 9 5 17 14 20 48 13 76 7 31 114

TP 1 TP 2 TP 3 TP 7 TP 8 TP 12 TP 13 TP 15 TP 16 TP 19 TP 21

Figure 5. Total Errors at 600 dpi (old programs & rules)

0

20

40

60

80

100

120

140


PhotoScore 22 5 19 29 63 64 28 123 27 42 106SharpEye 4 2 18 9 9 45 13 66 14 40 29SmartScore 9 4 18 39 30 53 19 96 13 30 119


16

Figure 6. Note Pitch Errors at 300 dpi (old programs & rules)

0.0

5.0

10.0

15.0

20.0

25.0

Test Page No. and Percent Wrong Pitch by Program

PhotoScore 20.6 14.4 5.6 0.6 0.3 0.6 0.4 3.3 0.0 0.6 1.2 4.35SharpEye 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 1.2 0.3 0.5 0.26SmartScore 1.5 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.3 1.2 0.30

TP 1 TP 2 TP 3 TP 7 TP 8 TP 12

TP 13

TP 15

TP 16

TP 19

TP 21 Mean

Figure 7. Note Duration Errors at 300 dpi (old programs & rules)

0.0

1.0

2.0

3.0

4.0

5.0

Test Page No. and Percent Wrong Duration by Program

PhotoScore 0.0 0.0 0.0 1.3 1.6 3.4 3.1 1.6 0.0 2.5 0.5 1.27SharpEye 0.0 0.0 0.0 1.3 0.3 4.0 2.7 3.8 0.0 3.8 1.0 1.53SmartScore 0.0 0.0 0.0 1.6 0.3 0.3 0.0 2.4 0.0 1.6 2.0 0.74

TP 1 TP 2 TP 3 TP 7 TP 8 TP 12

TP 13

TP 15

TP 16

TP 19

TP 21 Mean

3.2 “Feel” Evaluation Results, 2005As a “reality check”, this confirmed the hand error count results showing that all of the programs did worse on Level 2 music than on Level 1, and, in general, that SharpEye was the most accurate of the three programs. One surprise was that the subjects considered SmartScore the least accurate, but not by much: its rating and PhotoScore’s were very close. The gap from SharpEye to PhotoScore was not great either, but it was considerably larger.

17

The rating scale of 1(very poor) to 7 (excellent) was presented to the subjects as follows:

very poor poor fair fairly good good very good excellent1 2 3 4 5 6 7

For OMR accuracy on this scale, SharpEye averaged 3.41 (halfway from “fair” to “fairly good”) across both pages; SmartScore, 2.88 (a bit below “fair”), and PhotoScore, 3.03 (barely above “fair”); see Table 2. Several of the subjects offered interesting comments, though no great insights. For the full results, see working document “‘Feel’ Evaluation Summary”.

Table 2. Subjective OMR Accuracy

Test Page 8 (Bach)

Test Page 21 (Mozart)

Average

PhotoScore

3.94 2.13 3.03

SharpEye 4.44 2.38 3.41SmartScore

3.88 1.88 2.88

3.3 Integrating Information from All Sources, 2005We created and tested another page of artificial examples, “Questionable Symbols”, intended to demonstrate significant differences among the programs (Figure 8). The complete report on it, working document “Recognition of Questionable Symbols in OMR Programs”, tabulates results of our testing and other information sources on the programs’ capabilities: their manuals, statements on the boxes they came in, and comments by experts. Of course claims in the manuals and especially on the boxes should be taken with several grains of salt, and we did that.

18

Figure 8.

3.4 Hand Error Count Results, 2008As already mentioned, hand error counts in this second study involved newer versions of two of the three original programs plus one new program. In theory, the error-counting rules were only slightly different. In practice, however, we discovered that we had been considerably too lenient in applying the “don’t count secondary errors” principle in 2005. This time, we were much stricter; as a result, the numbers are not directly comparable. In particular, the dramatic reduction in errors from the earlier study is—unfortunately—more appearance than reality.

19

Figure 9. Total Errors at 300 dpi (new programs & rules)

0

20

40

60

80

100

120


PhotoScore 4 8 10 24 16 22 9 72 43 44 38SharpEye 7 1 19 9 21 30 9 53 43 75 21SmartScore 15 3 18 25 19 36 15 93 102 57 84Capella 10 4 24 17 8 24 10 40 34 52 112


Figure 10. Note Pitch Errors at 300 dpi (new programs & rules)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Test Page No. and Percent Wrong Pitch by Program

PhotoScore 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.3 0.0 0.09SharpEye 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.0 0.0 0.0 0.06SmartScore 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.4 0.3 0.2 0.13Capella 2.9 0.0 0.0 0.0 0.3 0.0 0.4 0.0 0.2 1.9 5.0 0.98

TP 1 TP 2 TP 3 TP 7 TP 8 TP 12 TP 13 TP 15 TP 17 TP 19 TP 21 Mean

20

Figure 11. Note Duration Errors at 300 dpi (new programs & rules)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

13.0

Test Page No. and Percent Wrong Duration by Program

PhotoScore 0.0 0.0 0.0 0.3 0.0 0.0 0.0 1.8 0.7 2.8 0.5 0.55SharpEye 0.0 0.0 0.0 1.3 0.0 1.9 0.0 1.1 0.9 11.9 1.0 1.63SmartScore 1.3 0.3 0.6 0.0 0.0 0.0 3.6 5.8 9.1 7.8 2.7 2.83Capella 0.0 0.0 0.0 1.6 0.3 0.0 0.4 1.1 0.0 2.8 0.5 0.61

TP 1 TP 2 TP 3 TP 7 TP 8 TP 12 TP 13 TP 15 TP 16 TP 19 TP 21 Mean

3.5 Black & white vs. Grayscale Images, 2008Surprisingly, black and white vs. grayscale images turned out to be a major issue with two of our four programs, PhotoScore Ultimate and SmartScore.

The publisher of PhotoScore Ultimate recommends grayscale. We ran four test pages in both black and white and grayscale to see what the difference would be. Contrary to their recommendation, with two, grayscale gave somewhat worse results; with the other two, it gave much worse results. The program's Help explains why grayscale is preferred, namely that “If you are scanning in gray, then the page does not need to be completely straight - PhotoScore will automatically make the page level without loss of detail. It will not be rotated if scanning in black & white, as this would result in loss of detail, thus giving less accurate recognition results.” However, the publisher's Support web page is more ambivalent on the question (see http://www.neuratron.com/faq/UnexpectedBehaviourScanning.htm#Poor_results):

Why do I get poor results from a clear piece of music?

The following may help to improve the accuracy: ... 6) In general, ensure you are scanning in 256 shades of gray... However, with some scanners you may achieve better results by scanning in 2 colors - also called 'b/w drawing', or '1-bit gray' - and manually adjusting the

21

brightness setting so that there are no broken lines or smudged objects.

With previous versions of SmartScore, the publisher recommended black and whiter; with SmartScore 5, they recommend grayscale. But again, the recommendation is questionable: in our tests, grayscale seemed to give worse results. From their web page "SmartScore X Scanning Notes and Scanner Compatibilty" (http://www.musitek.com/scanningnotes.html):

What is "dithering" and why it's a pain...

If you are scanning in "Black and White" in your scanner's software and SmartScore's recognition results are particularly poor (e.g. 50-75% accuracy), most likely the image created by your scanner was "dithered". Many modern scanners and "All-in-one" scanner/copier/printers are optimized for photos, not for traditional OCR (optical character recognition). When "Black and White" is selected, most All-In-One scanners actually create a gray image then convert it to black-and-white by means of "dithering"rather than by "thresholding". Dithering is fine for photographs, but it plays havoc with images intended for OCR. This is why we recommend you ALWAYS SCAN IN GRAY.

All of this suggests that even the publishers of commercial programs know no systematic way to decide even something as simple as whether black and white or grayscale images will give better results. This would seem to be another indication of the need for improved evaluation methods.

4. Conclusions: Prospects for Building a Useful SystemBased on our 2005 research, especially the “Questionable Symbols” table, we developed a set of possible “rules” for an MROMR system; for the full set of 17, see working paper “Proposed ‘Rules’ For Multiple-Classifier OMR”. Here are four, stripped of justification:

1. In general, SharpEye is the most accurate.

2. PhotoScore often misses C clefs, at least in one-staff systems.

3. For text (excluding lyrics and standard dynamic-level abbreviations like “pp”, “mf”, etc.), PhotoScore is the most accurate.

4. SharpEye is the worst at recognizing beams, i.e., the most likely to miss them completely.

These rules are, of course, far too vague to be used as they stand; they might be better described as “rule concepts”. In addition, they apply to only a very limited subset of situations encountered in music notation; 50 or even 100 rules would be a more reasonable number. Both the vagueness and the limited situations covered are direct results of the fact that we inferred the rule concepts by inspection of just a few pages of music and OMR versions of those pages, and from manually-generated statistics for the OMR versions. Finding a sufficient number of truly useful rules is not likely without examining a much larger amount of music—at a minimum, say, ten

22

times the 15 or so pages in our collection that we gave more than a passing glance. By far the best way of doing this is with automatic processes.

This is especially true because building MROMR on top of independent SROMR systems involves a moving target. As the rule concepts above suggest, SharpEye was more accurate than the other programs on most of the features we tested, so, with these systems, the question reduces to one of whether it is practical to improve on SharpEye’s results by much. One way in which it clearly is practical is for note durations. A more precise statement, in procedural form, of our rule concept 4 (above) is “when SharpEye finds no beams but the other programs agree on a nonzero number of beams, get the notes’ duration from the other programs”. But this still is not precise enough to implement as a program. What we need is something like “when SharpEye finds a series of notes of quarter-note or longer duration but the other programs agree on durations shorter than a quarter, get the notes’ duration from the other programs”. This rule would cut SharpEye’s note-duration errors by a factor of 3, i.e., to a mean of 0.51%, improving its average performance for this type of error from the worst of the three programs to by far the best. It would also reduce its worst-case error rate from 4.0% to 1.1%.

However, after the conclusion of that study, major upgrades to both SmartScore and PhotoScore were released; then PhotoScore Ultimate was released, which—as previously mentioned—is claimed by its publisher to actually incorporate the recognition engine of SharpEye (which they had acquired rights to) as well as the previous version of PhotoScore. We used both PhotoScore Ultimate and the new SmatScore in the 2008 study. The new PhotoScore is indeed much more accurate than the previous version, and that might make the rules for a MROMR system more balanced. On the other hand, it is quite possible that before such a system could be completed, SmartScore would be upgraded enough that it would be better on almost everything! And, of course, a new program—say, Audiveris—might do well enough to be useful. In any case, the moving-target aspect should be considered in the design process. In particular, devoting more effort to almost any aspect of evaluation is probably worthwhile, as is a convenient knowledge representation, i.e., way of expressing whatever rules one wants to implement. Apropos of this, Michael Droettboom (personal communication, November 2005) has said:

In many classifier combination schemes, the “rules” are automatically inferred by the system. That is, given a number of inputs and a desired output, the system itself can learn the best way to combine the inputs (given context) to produce the desired output. This reduces the moving target problem, because as the input systems change, the combination layer can be automatically adapted. I believe that such a system would be incredibly complex for music (both in terms of writing it and runtime), but there [are precedents] in other domains.

23

4.1 Chances for Significant Gains and a Different Kind of MROMRLeaving aside the vagueness and limitations of our rules, and the question of alignment before they could be applied, more work will be needed to make it clear how much rules like these could actually help. But it is clear that help is needed, even with notes: as we have said, in our tests, all of the programs repeatedly went well beyond reasonable limits for error rates for note pitch and note duration.

This suggests a very different multiple-recognizer method, despite the fact that it could help only with note pitches and durations: using MIDI files as a source of information to correct output from an OMR program. This is particularly appealing because it should be relatively easy to implement such a system, and because there are already collections like the Classical Archives (www.classicalarchives.com), with its tens of thousands of files potentially available at very low cost. However, the word “potentially” is a big one; these files are available for “personal use” for almost nothing, but the price for our application might be very different. Besides, many MIDI files are really records of performances, not scores, so the number that would be useful for this purpose is smaller—probably much smaller—than might appear.

4.2 Fixing the Inevitable MistakesA major issue is that of how users are to fix the inevitable mistakes of any reasonably foreseeable OMR system. All of the commercial systems have built in “split-screen” editors to let users do this: these editors show the OMR results aligned with the original scanned-in image. As general-purpose music-notation editors, it is doubtful they can compete with programs like Sibelius and Finale, but for this purpose, being able to make corrections in a view that makes it easy to compare the original image and the OMR results is a huge advantage. An MROMR system could probably handle this by feeding “enhanced” output in some OMR program’s internal format back into it. One candidate is SharpEye: its format appears to be simple and well-documented, so the problem should not be too difficult.

But the ideal way would almost certainly be to write a new split-screen music editor with features designed specifically for comparing multiple OMR results and correcting problems. That would be a major undertaking, but it could be extremely useful for other purposes, e.g., in an interactive program for comparing versions of the same music. Such a program, incorporating the “musicdiff” engine MROMR requires but with an appropriate user interface, undoubtedly has as much value for music as the comparison and merging features that are widely available for text these days. To our knowledge, no such program exists, and no one else has even proposed to develop one, so this would be a groundbreaking system. As a step in that direction, we have begun development of a “musicdiffGUI” tool. For MROMR, of course, it functions without a user interface, as part of the system’s “engine”. But a musicdiff that shows a user just what the differences are between several versions of a piece and lets them pick and choose which version to use where surely has as much potential as MS Word’s analogous features for text. One of its potential applications is for

24

visualization use by researchers on many music-informatics problems, including OMR. We have now written and demonstrated a prototype “musicdiffGUI” of this type; a demonstration version is currently available at http://www.informatics.indiana.edu/donbyrd/MeTAMuSE/MusicDiffGUI/ .

We have also begun design work for a “universal” multimedia visualization system, called the “General Temporal Workbench”, and obtained funding for initial development. Building on recent work by Kurth (2007) and others, this system would support coordinated display of views of any combination of visualizations of music and other media; for example, it could simultaneously show images of a music-notation score created from MusicXML and in scanned form, and waveforms and spectrograms of a performance of the same music, and play back the performance, with a cursor or other indicator in each view moving as it plays. It should be an ideal tool for use with future work on the music database envisioned by Goldsmiths research on MeTAMuSE and AMuSE.

4.3 Advancing the State of the ArtIn the long run, developing a practical MROMR system seems reasonable, but much needs to be done first. It is clear that the state of OMR evaluation is dismal, and one reason, we have argued, is that it is inherently very difficult. A simple way to at least allow some meaningful comparisons of OMR programs is via standard test methods, i.e., evaluation metrics and music to apply them to. But, to our knowledge, there are absolutely no standards of any kind in general use. For example, the guidelines we developed for this project include a list of factors that seem likely to affect OMR accuracy (see Appendix: Factors Affecting Accuracy); to our knowledge, this is the only such list in existence. If program A is tested with one collection of music and program B is tested with a different collection, and you don’t even have a basis for saying which music is more challenging, how can you possibly compare the results? What if one system is evaluated at a low level and the other at a high level, which might or might not lead to a much higher error rate? As things stand, there’s no way to know if any OMR system, conventional or multiple-recognizer, really is more accurate than any other for any given collection of music, much less for music in general.

We plan to make our test methods, collections (the files of scanned images, groundtruth files, and OMR-program output files), and findings available to the community. The music informatics world now has an annual forum for comparing research in various areas, namely MIREX (the “Music Information Retrieval Evaluation eXchange”). It is not hard to imagine a MIREX track for OMR employing the comparison technology, guidelines for choosing test pages, and music collections we have developed, perhaps with the evaluation methods of Bellini et al (2007). We have also discussed with Stephen Downie of the University of Illinois, director of MIREX, making some or all of the music collections we assembled for our OMR research publicly available, and have begun discussion of an OMR track for a future MIREX.

25

Second, it should be noted that, in terms of OMR, open-source programs are a distinct category from either research systems (which are not practical for most real-world use) or closed-source commercial programs (which are extremely difficult to either evaluate for accuracy or use in MROMR). Thus, we have high hopes for future versions of Audiveris and similar programs.

Finally, we believe that the idea of a musicdiff—both as a comparison engine and with a user interface— has great potential for a wide range of applications, and we intend to pursue it as part of the General Temporal Workbench project.

5. AcknowledgementsIt is a pleasure to acknowledge the assistance David Bainbridge, Michael Droettboom, and Ichiro Fujinaga gave us; their deep understanding of OMR technology was invaluable. In addition, Michael spent some time training Gamut/Gamera with our music samples. Bruce Bransby, Bill Clemmons, Kelly Demoline, Andy Glick, and Craig Sapp offered their advice as experienced users of the programs we tested. Graham Jones, developer of SharpEye, pointed out an important conceptual error in our early thinking as well as pointing out some misleading statements in an early draft of this paper. Don Anthony, David Bainbridge, Andy Glick, Craig Sapp, and Eleanor Selfridge-Field all helped us obtain test data. Tim Crawford and Geraint Wiggins helped plan the project from the beginning and offered numerous helpful comments along the way. Geoff Chirgwin wrote the musicdiff GUI, among other things. Anastasiya Chagrova, John Reef, and Abby Shupe helped in several ways, including continuing the hand error counting first done by one of us (Schindele). Finally, Don Anthony, Mary Wallace Davidson, Phil Ponella, Jenn Riley, and Ryan Scherle provided valuable advice and/or support of various kinds.

Ian Knopke’s work was supported by Indiana University internal funds. The project as a whole was made possible by the support of the Andrew W. Mellon Foundation; we are particularly grateful to Suzanne Lodato and Donald Waters of the Foundation for their interest in our work.

ReferencesBainbridge, David, & Bell, Tim (2001). The Challenge of Optical Music Recognition.

Computers and the Humanities 35(2), pp. 95-121.Bellini, Pierfrancesco, Bruno, Ivan, & Nesi, Paolo (2007). Assessing Optical Music

Recognition Tools. Computer Music Journal 31(1), pp. 68-93.Byrd, Donald (2004). Variations2 Guidelines For Encoded Score Quality. Available at

http://variations2.indiana.edu/system_design.htmlByrd, Donald (2007). OMR (Optical Music Recognition) Systems. Available at

http://www.informatics.indiana.edu/donbyrd/OMRSystemsTable.html

26

Droettboom, Michael, & Fujinaga, Ichiro (2004). Micro-level groundtruthing environment for OMR. In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR 2004), Barcelona, Spain, pp. 497–500.

Fujinaga, Ichiro, & Riley, Jenn (2002). Digital Image Capture of Musical Scores. In Proceedings of the 3rd International Symposium on Music Information Retrieval (ISMIR 2002), pp. 261–262.

Kilian, Jürgen, & Hoos, Holger (2004). MusicBLAST—Gapped Sequence Alignment for MIR. In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR 2004), pp. 38–41.

Knopke, Ian (2008). The Perlhumdrum and Perllilypond Toolkits for Symbolic Music Information Retrieval. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR 2008), Philadelphia, pp. 147—152.

Knopke, Ian & Byrd, Donald (2007). Towards MusicDiff: A Foundation for Improved Optical Music Recognition using Multiple Recognizers. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR 2007), Vienna, Austria, pp. 123–124.

Kurth, Frank; Müller, Meinard; Fremerey, Christian; Chang, Yoon-ha; & Clausen, Michael (2007). Automated Synchronization of Scanned Sheet Music with Audio Recordings. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR 2007), Vienna, Austria, pp. 261–266.

MacMillan, Karl, Droettboom, Michael, & Fujinaga, Ichiro (2002). Gamera: Optical music recognition in a new shell. In Proceedings of the International Computer Music Conference, pp. 482–485.

Ng, Kia C., & Jones, A. (2003). A Quick-Test for Optical Music Recognition Systems. 2nd MUSICNETWORK Open Workshop, Workshop on Optical Music Recognition System, Leeds, September 2003.

Ng, Kia, et al. (2005). CIMS: Coding Images of Music Sheets, version 3.4. Interactive MusicNetwork working paper. Available at www.interactivemusicnetwork.org/documenti/view_document.php?file_id=1194.

Pardo, Bryan, & Sanghi, Manan (2005). Polyphonic Musical Sequence Alignment for Database Search. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), London, England, pp. 215-222.

Prime Recognition (2005). http://www.primerecognition.comReed, K. Todd (1995). Optical Music Recognition. M. Sc. thesis, Dept. of Computer

Science, University of Calgary.Sapp, Craig (2005). SharpEye Example: Mozart Piano Sonata No. 13 in B-flat major,

K 333. http://craig.sapp.org/omr/sharpeye/mozson13/Selfridge-Field, Eleanor, Carter, Nicholas, et al. (1994). Optical Recognition: A

Survey of Current Work; An Interactive System; Recognition Problems; The Issue of Practicality. In Hewlett, W., & Selfridge-Field, E. (Eds.), (Computing in Musicology, vol. 9, pp. 107–166. Menlo Park, California: Center for Computer-Assisted Research in the Humanities (CCARH).

27

Working DocumentsThe following supporting documents (as well as this one) are available on the World Wide Web, at http://www.informatics.indiana.edu/donbyrd/MROMRPap ??NOT YET! -- VERSIONS THERE NOW ARE FROM 2006:Analyses of error patterns: OMR Error Patterns in the Van Eyck Collection;

Observations on Error Count Comparisons (2005 and 2008); Van Eyck Error Observations.

“Feel” Evaluation Summary (FeelEvalSummary.doc), 30 Nov. 2005.Guidelines for MeTAMuSE MROMR Test Pages (and factors affecting OMR

accuracy), 30 Sept. 2008.Music-Notation Complexity Levels For MeTAMuSE OMR Tests, 6 Oct. 2008.OMR Detailed Symbol Rule Report for SFStudy, 30 Nov. 2005.OMR Page Hand Error Count Report for SFStudy, 30 Nov. 2005.OMR Page Test Status for SFStudy, 20 Nov. 2005.Procedures And Conventions For The Music-Encoding SFStudy, 23 Oct. 2005.Procedures and Conventions for MeTAMuSE OMR Studies, 6 Oct. 2008.Proposed ‘Rules’ For Multiple-Classifier OMR, 9 Nov. 2005.Recognition of Questionable Symbols in OMR Programs, 9 Nov. 2005.Tips from OMR Program Experts, 18 July 2005.

Appendix: Factors Affecting Accuracy

(This is an excerpt from our working document “Guidelines for MeTAMuSE MROMR Test Pages (and factors affecting OMR accuracy)”.)

We know or suspect the factors below are significant in terms of the recognition accuracy of OMR programs; they should therefore be considered in choosing test pages. For MeTAMuSE tests, we generally want to minimize problems resulting from typographic style and image quality.

1. CMN complexityIn general, more complex notation is likely to get worse results, as one would

expect; but this topic is complex enough to deserve a document of its own.See OMRNotationComplexityDefs.txt.

2. Typographic style"Engraving" quality. Music of "engraved" quality are likely to give the best

results. Music of lower quality, but still with fixed-shape symbols (noteheads,rests, flags, accidentals, clefs, etc., as opposed to beams, slurs, etc.)having consistent shapes, is next best. Getting decent results with manually-produced ("manuscript" quality) images is probably hopeless.

Use of movable type or not. Music set from movable type -- with staff lines thathave numerous tiny breaks -- is likely to be much harder for programs to

handle.Conventions for shapes and positions of symbols. The standard "modern" symbol

shapes and positions are likely to give much better results. Some problem

28

cases:* Early 18th century editions, e.g., LeCene's of Vivaldi, with clefs

significantly different from modern ones, sharps rotated 45 degrees,downstemmed half-noteheads on the wrong side of the stem, etc., are

notlikely to work at all well, though it'd be interesting to try (LeCene forDavid Lasocki). (Is LeCene's style typical of early 18th-cent. editions?Probably so. Cf. e.g. Rastall's book.)

* Bach Gesellschaft editions, which use half-note heads as whole notes:probably OK, and too important to give up without a fight.

* Novello editions not set from movable type, with backwards bass clefs,backwards eighth rests used for quarter rests, etc.

* Looseness of spacing. Music with very tight spacing, either horizontally orvertically (e.g., our Beethoven Trio test page), is likely to give worseresults.

3. Image qualityPrint quality. Problems are likely to result from:

* Inferior quality reproduction, with breaks in stems, little holes in solidnoteheads, etc.

* Poor contrast* Bleedthrough (from the other side of the page)* Markings written on the copy, typically by students and teachers

Geometry of image. Curvature or other distortion of the image may lead toworse results. This seems most likely to be a problem at the left and rightedges, but the only occurrence in our collections we know of is at the _top_ ofp. 9 of Haydn Symphony no. 1. (But NB: This is more likely to be an artifact

ofthe scanning process than of the printed page!)

29

Methodology - Indiana University Bloomington · Web viewKlengel – H 11 1 Bach: Violin Partita no....

Documents

Transcript of Methodology - Indiana University Bloomington · Web viewKlengel – H 11 1 Bach: Violin Partita no....