The State of the Art of Music Information Retrieval

The State-of-the-Art of Music Information Retrieval

Scott McDermott

Center for Advanced Computer Studies University of Louisiana at Lafayette

Lafayette, LA 70504 USA +1 337 482 6284

[email protected]

1. ABSTRACT Music information retrieval systems seem to have reached an impasse. A significant number of attempts at creating efficient and effective systems to search through large databases of music have brought about almost the same number of solutions or approaches. Some of these have definitely valuable aspects, but to date it appears that no one solution is diverse, powerful, and flexible enough to obtain general acceptance in the music community. In this paper I critically examine the problem, analyze the current research, and suggest a clear direction for the field.

2. KEYWORDS Music information retrieval. MIR. Theme extraction. Melody extraction.

3. INTRODUCTION Over the past decade and especially since the turn of the millennium the digital music domain has swelled exponentially in terms of volume of available titles and that of general use. Especially since the development, standardization, and acceptance of the MPEG Layer 3 audio format, or MP3’s, for audio storage, music over the Internet has become a huge phenomenon, possibly surpassing CD consumption. According to their own statistics [8], “MP3.com features streaming and downloadable music from more than 200,000 Artists and over 1.2 million songs.” However, locating specific items within these vast datasets can be cumbersome and problematic. Even more sophisticated commercial databases such as MP3.com only allow the user to search for songs via key word searches (i.e. titles, composers, artists, or sometimes lyrics) of meta data associated with each performance. If, say, a user wishes to locate a song that he only recollects a short section of the tune, he is currently out of luck. At the same time, the field of information retrieval has also seen rapid development.

Today we have available to us some exceptionally sophisticated systems that help us find information that we desire. Most of the time, we use these IR systems without even knowing the level of detail and complexity they employ (and sometimes we do not even know that we are even using an information retrieval system). This transparency, especially in browsing the Web, has reached a level of maturity that now allows even the most untrained user to easily find the resources he desires. Strikingly, this disparity between the elegance and refinement of the basic IR systems and the simplistic and unadorned approaches used in music information retrieval (MIR) has often been ignored by the research community. Perhaps this is still largely due to the inherent interdisciplinary nature of MIR, but the general need for more sophisticated MIR systems can no longer be ignored. Already, a number of attempts at creating effective music information retrieval systems have utilized many of the basic concepts employed in the retrieval of text and other data, however much research is still needed in this direction.

4. DEFINING MIR In order to describe the challenge of designing effective musical information retrieval systems, we must first have a clear understanding of the medium involved. Storage of sound and music (there is a significant difference between the two) can fall into two basic categories [7]. Continuous, or raw, formats are easily the most popular. These formats include MP3, wav, or AIFF files and typically represent the actual performance of a piece. Almost exclusively, these files are sampled recordings stored at various qualities and sample rates. The other, less popular, but possibly more prolific category includes the discrete file formats such as MIDI, GUIDO, and music notation software formats (such as Encore and Finale files). These formats, more or less digitally describe the composer’s intents and/or instructions. Though the continuous formats almost always contain a tremendous amount more data and information, conversion between the formats is for the most part only available from discrete to continuous. Some researchers are pursuing the conversion in the other direction, but this is a very challenging problem that will take some time to resolve. It is mostly for this reason that much of the research in MIR has shifted to the trend of using only discrete formats (see summary for clarification). With the increased demand and usage of digitized audio, it is not incomprehensible that different user needs will emerge. For instance, given a large set, or database, of MP3’s, a user might want to be able to find a song that sounds like something he heard earlier, or another user might wish to determine if any of the songs in the database violates a copy-write (i.e. has strikingly similar rhythm, melody, and/or lyrics) of some specific song. Most of the current research reviewed here at the very least acknowledges this issue if not directly addresses it. Broadly, we can divide MIR into three basic sections. Since we will probably be mostly only concerned with the melody, rhythm, and/or lyrics, a very important aspect of MIR is the extraction of these elements from discrete format (such as MIDI). Once we have

determined the basic elements for the various items in the dataset, we will need to be able to perform queries on the database using melody matching techniques or other information retrieval algorithms. The last important aspect of a MIR system is the query presentation, or the interface the user manipulates to query the database and see the results. Many systems developed to date, simply allow the user to hum, sing, or play (i.e. on a keyboard) the melody that they wish to search for in the database. A good MIR system will take into account all three of these issues to create an effective, seamless experience. The next section of this paper examines a number of recent publications of MIR research, specifically dealing with the core aspects of the problems. I divide the section into two categories. The first group of paper reviews involves theme or melody extraction from both discrete and continuous formats. One could possibly consider this a subcategory of the next section as knowledge of the theme might lend itself to better ability to search a database of songs. The second section analyzes approaches to the problem of music information retrieval as a whole. Papers reviewed here generally deal with more than simply determining the theme of a song, but rather glimpse into pattern matching techniques, database organization, and other fundamental concepts.

5. PREVIOUS WORKS

5.1 Melody Extraction (theme indexing) The purpose of these systems is to automatically determine the main theme of a musical score. Though in and of itself this may not be entirely useful, knowledge of the themes could drastically improve the capabilities of a MIR system. Conceptually, many of the users of a prototypical music information retrieval system would have some melody, or theme, in mind to search for. Though defining the actual main theme of a composition might seem a bit abstract, numerous researchers have already described and detailed the themes of many important musical pieces [2] and have even reached some consensus. A great deal of research [3] has also gone into the psychological effects of music and the importance of these themes, or contours, to our perceptions and our ability to recollect songs. Therefore, knowledge of these themes and the ability to search a database for them should significantly improve the user’s experience.

5.1.1 HarmAn [4] Even though this project does not explicitly determine the theme of a piece of music, the results could certainly be integrated into a MIR system for added flexibility or added to a theme finder algorithm to enhance the performance. HarmAn [4], developed by the same research group as MME (below) and as its predecessor, harmonically analyzes musical scores and attempts to generate the chord progression. To do this, it divides the composition into partitions (see Figure 1) and compares the sets of notes within each partition to “templates” of chords to find a possible label for the chord. It then moves through the piece and, using a rather straightforward weighting algorithm, decides which chord partition labels most accurately depict the harmonic structure.

Though analyzing the harmonic structure of a musical composition may seem abstract and even daunting for a computer to perform, the simple and practical approach the authors implement works notably well. The authors even mention that the system should be flexible enough to support other genres besides western classical music by “introducing a template for the structure in question, and establishing a rule for resolving ties between the new template and existing ones.” Unfortunately, the paper does not explore this adaptability nor does effectively explicitly detail the performance of the system. The algorithm, however, does seem to run rather efficiently and knowledge of the harmonic structure could certainly be used to more accurately pick out the theme of a composition (though the same research group does not use this tool in the next section).

5.1.2 Melodic Motive Extractor [2] The Melodic Motive Extractor (MME) developed at the University of Michigan, attempts to extract themes from polyphonic scores by finding pattern repetition within the piece. The authors have developed [5] a number of complex algorithms that (almost in a statistical sense) numerically group various types of patterns within each voice of a score. The system then combines, sorts, and parses these pattern groups and then returns the matching sections of the piece that it guesses represent the main themes. Surprisingly, there is really no harmonic analysis or fundamental musical knowledge built into the engine, so the algorithms should work just as well on other genres, including non-western music (though the authors do not specify this). Though they took a few liberties, such as ignoring the lower voices within a MIDI channel or instrument, not truly taking into account the extreme variances that composers make with respect to themes, and even padding (widening) the results to better encompass the themes, the end results are difficult to argue with. For almost all test cases, the system returned the correct primary theme and a number of lesser important themes. Notably, however, the system does produce a significant amount of redundant (i.e. obvious variations of the same theme) and extraneous data output and seems to only have been tested on a limited number of “classical” pieces. To be sure, the authors do mention that the system seemed to perform well on more contemporary (“from the Beatles to Nirvana”) scores, yet they did not provide concrete results due to the lack of consensus on these themes.

5.1.3 Hidden Markov Models [5] Most current research on theme or melody extraction involves analyzing discrete formats of music, such as GUIDO or MIDI files, which contain specific information about each note event. The assumption of the availability of these types of formats significantly simplifies the problem. However, the most common forms of digital music are in continuous formats. These are recorded, or sampled, performances (often live) that include a great deal more “information” than the discrete formats. Melody extraction from these sources can be a daunting challenge at best due to the rich variety of complex disparities involved, including different music styles, artist, genres, recording equipment,

recording environments, and instrumentations. Therefore most researchers choose to analyze the discrete format instead and assume that the dataset will always have this available. The people at Georgia Institute of Technology [5], however, decide to take the path less traveled and attempt to extract melodies from the continuous wav format. The secret weapon of the HMM Melody Recognition System is to use “wordspotting techniques from automatic speech recognition”. This simple, yet inspired, approach allows the developers to use hidden Markov models to match patterns within a song. Using HMMs, their system can statistically analyze and receive training to parse through the database. In theory, this has a great deal of potential. However, in actuality, the system is flawed. First, and most importantly, it is extremely limited to a simple dataset. For this paper, only ten short monophonic, synthetically generated songs were used. Obviously, this comes nowhere close to the wide diversity available in continuous digital formats. The fact that it can only handle monophonic data severely limits the application. Additionally, they use correlated MIDI (discrete) files to train the system, and though they mention that the system does not require them (“but [is] rather a convenience”) this seems dubious at best. Finally, the retrieval system does not seem to take rhythm or even transposition into account when matching a query. As the authors mention in the conclusion, this is a “proof-of-concept”, and that being said, this does have some practical application possibilities. However, the more important contribution that the authors of this paper might make to the field of MIR is the conversion from continuous to discrete formats, of which they have preliminarily developed a couple of techniques. This paper almost inadvertently proves that MIR systems should continue to develop retrieval techniques for discrete formats and assume that somebody will create methods for converting raw, continuous data into a workable static form.

5.2 General Music Search Engines (full indexing) The following group or research projects attempt to tackle the problem of information retrieval with respect to music. They utilize a wide variety of techniques to solve the problem and yet no single approach seems to have a solution for every aspect of the puzzle. Most likely, as described in the summary below, a final and generally accepted solution for MIR will develop as a hybrid, or assimilation, of some of the more appealing aspects of each method.

5.2.1 The Johns Hopkins University [1] The approach implemented by the researchers at The Johns Hopkins University takes a practical and somewhat effective form. They start with an extremely large collection (around 30,000 selections) of 19th and 20th century American music. Though this data set largely ignores other important genres, like early European or non-western music, the approach they use for MIR seems mostly independent of the style of the music set. The researchers took a set of these scores and converted them from JPEG (i.e. scanned in images) to GUIDO (a discretized musical notational language similar to MIDI) format.

The main advantage to their approach to the problem lies in the fact that the core search engine is basically independent of the database information content. More accurately, they use standard information retrieval techniques on pre-compiled, or generated, data from each of the various scores. When a musical piece is added to the dataset, it is “ingested” by creating and storing indexes and partitions. This ingestion process can simply be modified to fit various genres or musical structures, such as non-western music. Indexes, which they refer to as “secondary indexes”, are stored as inverted lists of the various elements of a musical work. For instance, for each pitch (i.e. a, b, c, d, e, f, or g) the locations of the occurrences of the pitch in the score are stored (see Figure 2). The system can also combine some of these indexes to form more complex data structures (i.e. combining the pitch names and accidentals will create a single index for all the chromatic notes in the scale). In addition to these inverted lists, each score is also “partitioned” into larger sub-sections of the piece. Again these partitions are determined when the selection is added, or ingested, to the dataset. Partitions include clef, key, time signatures, titles, authors, and other musical concepts that can uniquely identify large portions of a composition. Finally, to search through the database of musical pieces, the authors employ standard information retrieval techniques that parse through the inverted lists. The real beauty of this approach is that the system can have multiple interfaces to perform a query. The authors specify that for different applications, or different users, various interfaces can set on top of the core engine as appropriate: “The purpose of these interfaces is to translate a set of user-friendly commands or interactions into a query string accepted by the search engine.” Finally, though this system seems to go in the correct direction, it is still unclear how effective it will work on polyphonic scores, pieces of non-western origin, and discretized sampled music (i.e. MP3’s). It does mention that partitions extend the application to handle polyphonic music, yet the authors do not detail the effectiveness. On another down side, though not explicitly mentioned, the additional large amount of storage needed for the generated sets will most likely cause a significant performance loss.

5.2.2 Polyphonic n-grams [6] Though a number of the specific details in this paper [6] are basically flawed, the approach of the researchers at the Imperial College in London has some definite merit. The system they have developed derives from successful techniques for pattern matching in monophonic compositions by the usage of n-grams. These n-grams are simply overlapping windows that segregate the piece into “melodic strings” which allow algorithms to search for patterns. The system then encodes these patterns, in this case all possible pitch change combinations as well as time difference ratios, into text characters. Once completed, common text search engines can then parse through these strings to find desired patterns.

One major problem with this, however, is that the authors base the encoding on the statistical nature of a selection of about 3000 classical compositions. This then limits the system to this genre and would require significant adaptation to work with anything other than western classical music. Additionally, quite a few of their approaches to some of the specific parts of the problem tend ignore any sort of practical solution. To introduce errors into queries, they insert Gaussian functions, and yet do not mention just having a statically significant number of user make their own queries (with inherit errors). Looking beyond some of these problems, the paper offers some significant potential. The approach to interval recognition and then encoding of polyphonic sources has great possibilities for future use. More substantially, the inclusion of the rhythmic dimension in the pattern search and the development of the rhythmic ratio to do this is probably one of this system’s better features. This ratio is determined using “the ratios of time difference between adjacent pairs of onset times [to] form a rhythmic ratio sequence.”

5.2.3 OMRAS [7] With probably one of the better developed works in this field, the people working on the OMRAS (Online Music Retrieval And Searching) project [7] seem miles ahead of the rest. They have obviously taken a great deal of time and effort to develop a core engine that is both efficient and effective. To be sure, the front end to the database is awkward and even cumbersome, yet that can be fixed as the system develops. For the time being, they have created a powerful search engine that, on the surface, appears to have a great deal of potential. Possibly the most alluring aspect of the system is its intuitively visual nature. Each musical score gets translated into a proprietary data structure akin to a “piano roll” (see Figure 3) that old player pianos used for music storage and retrieval. It is amusing that such an archaic form of a music storage file translates into such an effective digital format. More important to the retrieval system, however, are the well designed matrix algorithms used for parsing and searching for pattern matches. The author begins with a simple, “based” algorithm to find strict, note-for-note matches and then expanded it to add features and flexibilities. The end result allows a user to “search a musical text given a polyphonic query and a number of degrees of freedom.” Though not specified by the author, this system ostensibly should handle more than the standard western style classical music test cases. It seems that no harmonic theory or fundamental musical knowledge was integrated into the system. As long as a composition can translate into the piano roll representation, it should work for retrieval. However, herein lies the one major flaw of this design. The system requires the translation and storage of already accepted data files (i.e. MIDI) into its own proprietary data structure. This increased overhead may result in a more efficient and effective retrieval of entries, but it also somewhat limits the flexibility of the application.

6. SUMMARY As mentioned previously, none of the papers reviewed here perfectly solves the problems inherent in music information retrieval. In fact, many of the authors point out this fact about their own work. Regardless, each of these papers complements the field of MIR in some manner or another. As a whole, however, one thing they do make clear is the truly complex nature of the problem. Almost everybody who approaches any aspect of MIR ends up dealing with one form or another of the discrete group of file formats while those who work with continuous formats inevitably demonstrate the many intrinsic problems with this approach. Attempting to manipulate continuous formats while performing music information retrieval at the same time boils down to juggling two major problems simultaneously. The HMM Melody Recognition System [5] is a case in point. Therefore the most obvious solution seems to be to continue pursuing theme extraction and other MIR aspects on discrete formats and assume that eventually we will have the ability to convert raw music into some discrete layout. Admittedly this will cause some definite overhead when researchers eventually apply these algorithms to databases of continuous music, however, this might be acceptable. Data storage is relatively cheap and having a MIDI (or some other type of) file associate with each data entry should not dramatically increase the database size since discrete forms are often only a minimal fraction of the size of continuous files. Additionally, algorithms can parse through these formats much more efficiently, and as long as we use good conversion algorithms, the results depend only on the search algorithms. As far as determining some overall general method, I recommend assimilating key points of the various approaches into one system. The OMRAS [7] approach seems to have the most flexible and adaptable core engine to begin with. The extendable matrix based algorithms are very efficient and effective, though a bit unwieldy. Building a multifaceted user interface on top of the core engine such as the researchers at The John Hopkins University [1] employ should further enhance the system. Different types of users will use the system in a number of ways and this will cause various forms of queries to be sent to the search engine. Obviously, one possible query will be of the form of melody searching, therefore knowledge and even pre-indexing of the themes (by using techniques employed in the MME [2]) for each entry in the database could dramatically improve the efficiency and accuracy of the application. Even allowing the option of searching for certain chord progressions, as HarmAn [4] does, could prove useful. A user might want to search for pieces with standard blues progressions or this could be used to verify copy-write protection violations. Ostensibly this ideal MIR system could also use the inverted lists concept for indexing search terms, however, this added flexibility might just make the system too slow or simply ineffective. Finally, since the core search engine of OMRAS does not really include rhythmic search capabilities, we might be able to employ the rhythmic ratio technique of the n-grams [6] approach. Combining all of these details into one music information retrieval system should require noteworthy effort and care must certainly be taken to maintain the usability of the project.

7. References 1. Droettboom, Michael, Ichiro Fujinaga, Karl MacMillan, Mark Patton, James Warner, G. Sayeed Choudhury,

and Tim DiLauro. Expressive and efficient retrieval of symbolic musical data. In Music IR 2001, October 15-17, Bloomington, Indiana.

2. Meek, Colin, William P. Birmingham. Thematic Extractor. In Music IR 2001, October 15-17, Bloomington, Indiana.

3. Uitdenbogerd, Alexandra L., Abhijit Chattaraj, Justin Zobel. Music IR: Past, Present and Future. In Music IR 2000, October 23-25, Plymouth, Massachusetts.

4. Pardo, Bryan and William P. Birmingham. Automated Partitioning of Tonal Music. In FLAIRS 2000 (AAAI), May 22-24, Orlando, Florida, pages 23-27.

5. Durey, Adriane Swalm and Mark A. Clements. Melody Spotting Using Hidden Markov Models. In Music IR 2001, October 15-17, Bloomington, Indiana.

6. Doraisamy, Shyamala and Stefan M Rüger. An Approach Towards A Polyphonic Music Retrieval System. In Music IR 2001, October 15-17, Bloomington, Indiana.

7. Dovey, Matthew J. A technique for “regular expression” style searching in polyphonic music. In Music IR 2001, October 15-17, Bloomington, Indiana.

8. MP3.COM. http://www.mp3.com/aboutus.html

The State of the Art of Music Information Retrieval

Documents

Transcript of The State of the Art of Music Information Retrieval