Improvisation in Interactive Music Systems · in an ensemble setting. The research was...

Improvisation in Interactive Music Systems

c© Toby Gifford

A thesis submitted in partialfulfillment of the degree of

Doctor of Philosophy

Faculty of Creative IndustriesQueensland University of Technology

March 2011

Brisbane Queensland

Key Words

interactive music systems; generative; algorithmic composition; beat tracking; metreinduction; onset detection; polyphonic; pitch tracking; machine listening; improvisa-tion

ii

Abstract

This project investigates machine listening and improvisation in interactive music systems withthe goal of improvising musically appropriate accompaniment to an audio stream in real-time.The input audio may be from a live musical ensemble, or playback of a recording for use by aDJ. I present a collection of robust techniques for machine listening in the context of Westernpopular dance music genres, and strategies of improvisation to allow for intuitive and musicallysalient interaction in live performance.

The findings are embodied in a computational agent – the Jambot – capable of real-time musicalimprovisation in an ensemble setting. Conceptually the agent’s functionality is split into threedomains: reception, analysis and generation. The project has resulted in novel techniques foraddressing a range of issues in each of these domains.

In the reception domain I present a novel suite of onset detection algorithms for real-timedetection and classification of percussive onsets. This suite achieves reasonable discriminationbetween the kick, snare and hi-hat attacks of a standard drum-kit, with sufficiently low-latencyto allow perceptually simultaneous triggering of accompaniment notes. The onset detectionalgorithms are designed to operate in the context of complex polyphonic audio.

In the analysis domain I present novel beat-tracking and metre-induction algorithms that oper-ate in real-time and are responsive to change in a live setting. I also present a novel analyticmodel of rhythm, based on musically salient features. This model informs the generation pro-cess, affording intuitive parametric control and allowing for the creation of a broad range ofinteresting rhythms.

In the generation domain I present a novel improvisatory architecture drawing on theories ofmusic perception, which provides a mechanism for the real-time generation of complementaryaccompaniment in an ensemble setting.

All of these innovations have been combined into a computational agent – the Jambot, whichis capable of producing improvised percussive musical accompaniment to an audio stream inreal-time. I situate the architectural philosophy of the Jambot within contemporary debate re-garding the nature of cognition and artificial intelligence, and argue for an approach to algorith-mic improvisation that privileges the minimisation of cognitive dissonance in human-computerinteraction.

This thesis contains extensive written discussions of the Jambot and its component algorithms,

along with some comparative analyses of aspects of its operation and aesthetic evaluations of its

output. The accompanying CD contains the Jambot software, along with video documentation

of experiments and performances conducted during the project.

iii

Contents

Keywords ii

Abstract iii

List of Figures vi

Supplementary CD vii

Statement of Original Authorship ix

Acknowledgements x

1 Introduction 11.1 Project Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Knowledge Claims . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Guide to the Accompanying CD . . . . . . . . . . . . . . . . . . . . 111.4 Associated Publications . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Background 152.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Interactive Music Systems . . . . . . . . . . . . . . . . . . . . . . . 152.3 Approaches to Improvisation . . . . . . . . . . . . . . . . . . . . . . 192.4 Musical Pulse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5 Musical Metre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.6 Utilising Metrical Ambiguity . . . . . . . . . . . . . . . . . . . . . . 30

3 Methodology 313.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Research Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3 Epistemology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Theoretical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 413.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.6 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Architecture 604.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2 Jambot Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

iv

4.3 Architectures for Perception and Action . . . . . . . . . . . . . . . . 644.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Reception 745.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.2 Stochastic Onset Detection . . . . . . . . . . . . . . . . . . . . . . . 755.3 Polyphonic Pitch Tracking . . . . . . . . . . . . . . . . . . . . . . . 865.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6 Analysis of Metre 976.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.2 Beat Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.3 The Substratum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.4 Estimating the Substratum . . . . . . . . . . . . . . . . . . . . . . . 1056.5 Estimating the Substratum Phase . . . . . . . . . . . . . . . . . . . . 1126.6 Mirex Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.7 Attentional Modulation . . . . . . . . . . . . . . . . . . . . . . . . . 1186.8 Finding the Bar Period . . . . . . . . . . . . . . . . . . . . . . . . . 1196.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 Analysis of Rhythm 1237.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237.2 Representing Onsets as Notes . . . . . . . . . . . . . . . . . . . . . . 1237.3 Rhythmic Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8 Generation 1388.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1388.2 Reactive Generation Techniques . . . . . . . . . . . . . . . . . . . . 1398.3 Proactive Generation Techniques . . . . . . . . . . . . . . . . . . . . 1538.4 Interactive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1738.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

9 Conclusion 1769.1 Reception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1779.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1809.3 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1829.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1859.5 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1879.6 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

A Derivation of Phase Estimate 189

Bibliography 193

v

List of Figures

2.1 Coherence metre . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Theoretical Framework – adapted from (Crotty 1998:4) . . . . . . . . 333.2 Scaling up problem (Kitano 1993) . . . . . . . . . . . . . . . . . . . 50

4.1 Reception, Analysis and Generation . . . . . . . . . . . . . . . . . . 63

5.1 Attacks can be masked in multipart signals . . . . . . . . . . . . . . . 765.2 Splitting out the RCC, Amplitude vs Samples . . . . . . . . . . . . . 795.3 Comparison of Detection Functions . . . . . . . . . . . . . . . . . . 81

6.1 substratum pulse is quavers . . . . . . . . . . . . . . . . . . . . . . . 1056.2 SOD stream salience of a segment of the Amen Break . . . . . . . . . 1086.3 ACF for the SOD stream salience . . . . . . . . . . . . . . . . . . . . 1086.4 Clumped ACF for the SOD stream salience . . . . . . . . . . . . . . 1096.5 MIREX training set example 5. . . . . . . . . . . . . . . . . . . . . . 116

8.1 Rozin’s Wooden Mirror . . . . . . . . . . . . . . . . . . . . . . . . . 1438.2 Children interacting with the Continuator . . . . . . . . . . . . . . . 1448.3 Edward Ihnatowicz’s Sound Activated Module . . . . . . . . . . . . . 1488.4 Edward Ihnatowicz’s Senster . . . . . . . . . . . . . . . . . . . . . . 1498.5 Simon Penny’s Petit Mal . . . . . . . . . . . . . . . . . . . . . . . . 1508.6 Offbeat snare hits with anticipatory timing . . . . . . . . . . . . . . . 1638.7 Offbeat snare hits without anticipatory timing . . . . . . . . . . . . . 1638.8 Dancehall beat with anticipatory timing . . . . . . . . . . . . . . . . 1648.9 Dancehall beat without anticipatory timing . . . . . . . . . . . . . . . 1648.10 A depiction of the Chimæra on an ancient Greek plate . . . . . . . . . 1668.11 Data from three updates of the inferred metric contexts . . . . . . . . 1708.12 Simple example of ambiguity strategies . . . . . . . . . . . . . . . . 171

vi

Supplementary CD

Root Folderthesis.pdf

1 IntroductionMainDemo.movRobobongoAllstars.movRobobongoAllstars2.mov

2 Reception/Polyphonic Pitch Tracking4 ton mantis.mp34 ton mantis jam.mp3mas que nada.mp3mas que nada jam.mp3sweet.mp3sweet jam.mp3

2 Reception/Stochastic Onset DetectionJungleBoogie bq.mp3JungleBoogie hf.mp3JungleBoogie nd.mp3JungleBoogie.mp3train* bq.mp3train* hf.mp3train* nd.mp3

3 Analysis of MetreMirexPracticeData.zip

3 Analysis of Metre/MirexJambottrack*.mp3

vii

3 Analysis of Metre/MirexBeatRootBeatRootTrack*.mp3

4 Generation/ReactiveBernardLubat.mp4Children.mp4petit mal.mp4SAM.mpgsenster.mpg

4 Generation/Proactivechimaera.movoriginal.mp3disambiguate.mp3ambiguate.mp3follow.mp3

4 Generation/Proactive/Anticipatory TimingAlternatingSnareAndHatAnticipation.mp3AlternatingSnareAndHatNoAnticipation.mp3AlternatingSnareAndHatAnticipation.mp3DanceHallAnticipation.mp3DanceHallNoAnticipation.mp3EnsembleAnticipation.mp3EnsembleNoAnticipation.mp3AmenBreak.mp3

DiscussionBeatTracking.movAttentionalModulation.mov

A Softwareimpromptu.appjambot-demo.scmYbot.componentAmenBreak.wavREADME.rtfYbot.component

viii

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet require-ments for an award at this or any other higher education institution. To the best of myknowledge and belief, the thesis contains no material previously published or writtenby another person except where due reference is made.

Signature:

Date:

ix

Acknowledgements

First and foremost thanks must go to my supervisors, Andrew Brown and Steve Dil-lon, whose patience, energy and commitment are nothing short of remarkable. Manythanks also to Cori Stewart and Siall Waterbright for their invaluable guidance in theintricacies of the English language. To Andrew Sorensen for creating Impromptu,which has played a large role in my work. To the QUT postgrad community who mademy experience what it was. And lastly to all my friends and family who have patientlywaited for me to emerge from the cocoon that is a PhD.

x

Chapter 1

Introduction

The difference between a tool and a machine is not capable ofvery precise distinction.

Babbage (1833)

1.1 Project Summary

This project aimed to develop a computational musical agent capable of improvisingin an ensemble setting. The research was practice-led, drawing on my practice as acomputational instrument builder and live acoustic/electronic musician. Consequentlythere was a design requirement for the computational agent – dubbed the Jambot –to be a complete operational system of sufficient quality for use in live performance.Furthermore, the goal was to be able to process complex audio, such as a full band, orto augment a recording for use by a DJ.

Viewed from another perspective, this project was an investigation into algorithms formachine listening and algorithmic improvisation. Conceptually the constituent algo-rithms are divided into three domains, reception, analysis and generation. Receptionrefers to the conversion of raw audio into timestamped notes. Analysis refers to beat-tracking and metre induction, as well as various rhythmic analyses. Generation refersto processes of algorithmic improvisation, aimed at producing complementary per-cussive accompaniment to a musical input signal. This project aimed to develop newtechniques in all of these domains, subject to the constraint of needing to work togetherin a functioning, robust, concert-ready system.

From another perspective again this project aimed to provide commentary on theoriesof music perception and music analysis, by encoding them into computational imple-mentations, and assessing their veracity in the context of an interactive music system.In particular, this project aimed to investigate appropriate architectures for music rep-resentation, machine listening and algorithmic improvisation, with a view to providingcommentary on plausible architectures for human music perception and cognition.

1

1.1.1 Personal Motivations

I am an acoustic musician, playing primarily the clarinet, in a variety of styles in-cluding jazz, funk, rock, with a strong interest in folk dance genres. I am also a liveelectronic musician, performing both as a DJ and hybrid acoustic/electronic musician.My interest is primarily popular dance music, in a wide range of genres from discothrough to Balkan gypsy.

As both an acoustic and electronic musician I have a natural interest in combining thesetwo modes of performance. In my experience the two modes operate in somewhatincompatible paradigms; certain aspects of performance that are natural in one arequite unnatural in the other. The idea of this project has been to create a technologicalbridge between the modes of performance that transforms this incompatibility into acomplementarity.

An attribute of electronic performance, that often differentiates it from traditionalacoustic performance, is that technology can release the performer from the ‘one-gesture-one-sound’ paradigm (Wessel 1991) nigh on ubiquitous for acoustic instru-ments. In playing a traditional acoustic instrument each gesture (such as moving afinger) is generally responsible for an individual sonic event. However with technol-ogy such as samplers, sequencers and complex synthesiser patches a single gesturecan set in motion a long series of sonic events. In the case of a DJ this enables a soloperformer to produce a full and engaging sound in a manner that would be difficult tomatch as a solo acoustic instrumentalist, at least in typical DJ performance contextssuch as dance clubs.

On the other hand, live acoustic performance involves a number of attributes oftenlost in electronic performance that I feel are intrinsic to the experience of live music.In particular the communicative dynamics of an ensemble, particularly when playingimprovised music, I believe to play a large role in my enjoyment of live performances.One aspect of the ensemble dynamics notably missing from electronic performance, atleast when operating outside of the one-gesture-one-sound paradigm, is human tempovariation.

For example, the use of loops and sequencers in combination with live acoustic per-formance is quite widespread, particularly in genres such as hip-hop (Reynolds 2006).However, to my mind, as soon as a loop is added to the mix something of the live feelis lost as the performers are locked into the tempo of the loop.

I suggest that the fundamental problem is that the technology is not ‘listening’ and so

2

not responsive to what the rest of the ensemble does. The idea of the Jambot is to ad-dress this issue. One potential application of the technologies developed in this projectis tempo tracking for controlling sample playback, although I haven’t concentrated onthis application. Rather, I have been interested in creating algorithms that both listenand respond.

Mixing acoustic with electronic performance is not the only motivation (or application)of this research. Another primary goal is to develop tools for use by DJs. The Jambotlistens to an audio signal – this might be from a band, or it could be listening to anaudio recording. So a DJ could make use of the Jambot by having it listen and jam toa record. In an abstract sense this application is similar to the first, in that I seek forthis technology to increase the fluidity, flexibility and dynamicism of live performancewith electronics.

Through the course of this research my focus changed to lean more heavily towardsDJ applications, rather than live ensembles. In the middle stages of the project I con-ducted a number of performance trials with a live ensemble – the Robobongo Allstars– examples of which can be found on the accompanying CD. As the project progressedI concentrated more on having the Jambot augment playback of recordings. The rea-son for this was primarily pragmatic; from an experimental perspective the ease andrepeatability of using an audio file input compared with corralling a live ensemble,recording the improvisation and later analysing the results, meant that progress on theunderlying algorithms was substantially quicker with prerecorded input.

Concentration on DJ applications also entailed a shift in focus from an autonomoussystem to a semi-autonomous system, with the DJ able to manipulate aspects of theoutput via parametric control. This involved a shift in the phenomenological aims.When developing the Jambot as an autonomous agent I was interested in interactionstrategies for creating an impression of agency. In semi-autonomous mode, the focusshifted to providing an intuitive and expressive interface for parametric control of thegenerated output.

1.1.2 Creative Practice Research and my Role as Researcher

This project has been conducted in the creative practice paradigm (Hamilton & Jaaniste2009). In creative practice research it is common for the researcher to utilise subjectivemeasures, such as aesthetic evaluation, in reflecting on their practice (Stock 2010). Thedesign goal for the Jambot was to provide appropriate and complementary musicalaccompaniment within a limited musical scope. The evaluation of appropriateness and

3

complementarity in this research has been framed by my aesthetic preferences and theintended musical scope. In the sections below I discuss these issues further.

Creative Practice Research

Creative practice research is research that is “initiated in practice, where questions,problems, challenges are identified and formed by the needs of practice and practition-ers ... and the research strategy is carried out through practice” (Gray 1998:3). TheQueensland University of Technology’s doctoral research regulations formally recog-nise creative practice research projects (QUT 2011), requiring that the candidate nom-inate a percentage split between the theoretical and practical outcomes of the project.This PhD project is a creative practice project with a split of 70% theoretical to 30%practical.

My practice has two mutually informing aspects – computational instrument building,and live performance. The practical outcome of this project is the Jambot itself. Inaddition to the Jambot, I have provided video and audio documentation of musicalperformances with the Jambot. These are not intended to be considered as the practicaloutcomes of this project per se, but rather as another source of documentation of theJambot.

Musical Scope

I developed the Jambot in the context of my creative practice in hybrid acoustic/elec-tronic live performance. The musical scope of interest to me is popular dance music ina wide variety of genres, including electronic styles such as breakbeat, drum ‘n’ bass,house and techno, as well as acoustic styles such as funk, rock, latin and gypsy. Somecommon threads are that the music be intended for dance, strongly pulse driven, andpopulist rather than experimental. A further practical limitation, for the Jambot in itscurrent form, is that the music contain percussive elements. As such the Jambot is notcurrently able to completely replace a rhythm section; the ability to do so is a goal forfuture development.

Evaluation of Musical Appropriateness

In presenting the research questions and discussing the outcomes of this project I fre-quently refer to musical appropriateness or complementarity. These terms should beunderstood in the context of the goal of the Jambot, which is to provide stylisticallyconvincing output within the musical scope outlined above. In assessing how stylis-tically convincing the Jambot’s output is I have utilised my own subjective aestheticevaluation using a set of critical values discussed in §3.5.1 and §3.6.5.

4

1.1.3 Research QuestionsIn describing this research I have divided the investigation into three domains: re-ception, analysis and generation. New knowledge has been created in each of thesedomains. This delineation is reflected in the architecture of the Jambot (§4.2.1), andalso corresponds roughly to the chronological order in which the research was carriedout (§3.6.1). Additionally, this research sought to find an effective architecture forcombining these three domains.

Reception

How can salient events be extracted from a raw audio stream? The Jambot is designedfor use with an audio input, rather than symbolic input such as MIDI. The first stageof its processing chain is to extract salient events from the raw audio stream. I inves-tigated techniques for detecting the onsets of notes, both percussive and pitched, andtechniques for polyphonic pitch tracking.

The extraction of features from complex audio is an active area of research in Com-putational Auditory Scene Analysis, posing significant engineering challenges. Oneresearch focus within this field is automatic transcription of a musical audio signalinto notes (Plumbley et al. 2002). This project aimed to provide incremental advancesin this area, sufficient for the overall goal of generating appropriate improvised ac-companiment. In particular it aimed to provide precise timing information for use inbeat-tracking.

Analysis

How can the notes be analysed into musical features? In order to provide appropriatemusical accompaniment, the Jambot performs a number of musical analyses on thesurface notes. In particular the Jambot performs beat-tracking and metre induction toenable real-time metrically appropriate responses. The Jambot also performs rhyth-mic analyses, so as to understand the musically salient rhythmic features of the inputsignal.

Computational musical analyses such as these are the domain of machine listening.Much research has been conducted in machine listening, particularly in beat-trackingand metre-induction (Dixon 2001). However, a general robust beat-tracking algo-rithm is yet to be found, and is perhaps unachievable (Goto & Muraoka 1995; Collins2006b:99). This project aimed to extend work in this field to enable the Jambot tofollow mildly varying tempos, and quickly recover from abrupt tempo changes, in per-cussive dance music.

5

The Jambot’s collection of rhythmic analyses together form a model of rhythm, whichinforms the Jambot’s generation process. The aim of the model was to provide aparametrisation of rhythm space into musically salient features, to allow for intuitiveparametric control over rhythm generation.

Generation

How can appropriate musical responses be generated? The purpose of the Jambot isto provide improvised accompaniment in an ensemble setting. Having parsed the rawaudio signal into notes, and analysed these notes into musical features, how can theJambot generate musically appropriate responses?

The field of interactive music systems is concerned with the creation of computationalmusical agents that improvise in an ensemble. A range of processes have been studied,including transformations of the input signal, and generative techniques from algorith-mic composition.

Transformative processes tend to be quite robust, whilst generative processes can allowfor more interesting and novel accompaniment. Most interactive music systems use ei-ther transformative or generative processes, but not both (§2.2). This project aimedto find ways of combining transformative and generative processes to provide for ro-bust and interesting accompaniment of real-world complex audio signals in a concertsetting.

Architecture

What is an effective architecture for combining reception, analysis and generation?The reception, analysis and generation modules are not independent – they exist ina single coherent architecture. The Jambot’s architecture is a complex hierarchicalstructure with bi-directional feedback between the layers of the hierarchy.

Many architectures have been studied for interactive music systems, music analysis,and music generation (§4.3). More broadly, many architectures have been appliedto problems in Artificial Intelligence (AI) (§4.3.2). Some of the more successful ex-periments in robotics have criticised traditional AI architectures based on knowledgerepresentations (§4.3.3). Instead they emphasise the importance of feedback, both be-tween the robotic agent and its environment, and between modules within the agent’sperceptual architecture. This project aimed to apply architectural notions from situatedrobotics to an interactive music system, in the hope of creating a more robust system.

6

1.2 Knowledge Claims

This project has generated new knowledge in a number of disciplines. As an interdis-ciplinary project, the findings of this research have relevance to science, engineeringand the creative arts.

The knowledge claims for this project are divided into the same three domains as theresearch questions, namely reception, analysis and generation. Additionally, I presentsome architectural findings, relating to the manner in which reception, analysis andgeneration interact in a complete functioning system.

1.2.1 Reception Knowledge Claims

In the reception domain I present a novel suite of low latency percussive onset detec-tors for use in complex audio, able to achieve reasonable discrimination between thekick, snare and hi-hat attacks of a standard drum-kit. These detectors all operate withsufficiently low latency to allow real-time imitation (and timbral remapping) of a drumtrack with imperceptible delay.

The suite consists of three onsets detectors: the SOD, MID and LOW detectors. Theseare tuned to detect the hi-hat, snare and kick attacks respectively. More broadly theycan be thought of as detecting high, mid and low frequency percussive onsets, althoughthey are not simply band-pass filtered energy detectors.

Of the three detectors, the SOD algorithm (§5.2) represents the most significant con-tribution to knowledge. It significantly departs from existing detection schemes bysearching directly for growth in the noisiness of the signal. It is more robust to con-founding high frequency noise than existing detectors.

The MID detector (§5.2.7) combines a standard band-pass filtered energy detector witha variation of the SOD algorithm. The LOW detector (§5.2.6) is simply a band-passfiltered energy detector. These detectors do not represent as great a contribution toknowledge as the SOD algorithm. However, the implementation details required toachieve sufficient accuracy and discrimination at the desired latency represents a sub-stantial research effort.

I also present a novel technique aimed at improving the accuracy and latency of pitchestimates in polyphonic pitch tracking (§5.3). The resulting algorithm is, however, of

7

insufficient quality for production use. Consequently the Jambot uses only informationfrom the onset detection suite, and produces only rhythmic accompaniment.

1.2.2 Analysis Knowledge Claims

In the analysis domain I present new findings relating to computational metric anal-ysis, and to computational rhythmic analysis. The metric analysis findings relate tobeat-tracking and metre induction. The rhythmic analysis findings present a model ofrhythm based on musically salient features.

For the purposes of beat-tracking, I introduce a new music theoretical notion calledthe substratum (§6.3). The substratum acts as a referent pulse level, distinct fromexisting notions of referent pulse such as the tactus and tatum. The substratum is amathematical property of the musical surface, rather than a perceptual property suchas the tactus, and so is more amenable to computational implementation.

I present a novel beat-tracking algorithm. This algorithm combines and extends anumber of existing approaches, but has significant differences. It achieves reason-able tracking in percussive dance music with mildly varying tempo, and also recoversquickly from abrupt tempo (or phase) changes. The beat-tracking algorithm presentsnovel techniques for the estimation of both beat period (§6.4) and beat phase (§6.5).The phase estimation is better suited to high precision onset timing input (such asachieved by the Jambot’s onset detection suite) than existing methods, and is also de-signed to be most accurate at the time of estimation, unlike many existing methodswhich are most accurate in the middle of a short window of history prior to the time ofestimation.

I also present a novel metre-induction algorithm (§6.8), which estimates the numberof beats in a bar (but not the position of the downbeat) using the notion of beat-classinterval invariance. More complex notions of metre are not estimated.

The Jambot’s rhythmic analyses together form a model of rhythm (§7.3.4). This modelinforms the generation process. As such, the model is designed to parametrize rhythmspace into musically salient features. This affords intuitive parametric control over theJambot’s rhythm generation.

8

1.2.3 Generation Knowledge Claims

The generation stage is the third component of the Jambot’s architecture. The taskof the generation stage is to produce musically appropriate actions, given the currentmusical context inferred by the reception and analysis stages. The generation stageutilises three improvisatory strategies: reactive, proactive and interactive.

The reactive strategy uses a novel technique called transformational mimesis (§8.2.6).Transformational mimesis is imitation that has been transformed to obfuscate the con-nection between the imitation and its source. It is similar in philosophy to the existingnotion of reflexivity in interactive music systems (§2.2), but differs in that it utilisessynchronous imitation, made possible by the low latency onset detectors. I present adescription of an interaction paradigm termed Smoke and Mirrors (§8.2.2) to whichtransformational mimesis and reflexivity both conform.

The proactive strategy uses an novel approach to rhythm generation (§8.3.7) based onthe model of rhythm defined in the analysis stage. The approach sets target values forsalient rhythmic features, and continuously searches for musical actions that tend tomove the ensemble towards these target values. The target values can be controlledby a human in real-time. This results in appropriate rhythmic accompaniment, withintuitive parametric control over salient aspects of the accompaniment.

In order to allow for intuitive parametric control of this generative process, I haveintroduced a novel search technique called anticipatory timing (§8.3.6). Anticipatorytiming is designed to provide a good trade-off between computational efficiency andoptimality. It involves an extension of greedy optimisation that allows for finessingof the timing of actions. It is particularly appropriate when searching for optimalrhythmic actions under real-time constraints.

I also present a meta-generative algorithm called the Chimæra (§8.3.10) for automat-ically controlling some of the target values for the rhythmic features. The Chimæraaims to achieve specifiable levels of metric ambiguity by selectively emphasising moreor less plausible metric assumptions gathered from the metric analysis stage.

I present a novel mechanism for combining proactive and reactive strategies called thejuxtapositional calculus (§8.4.1). The idea of the juxtapositional calculus is to providea means of deviating from a baseline of transformational mimesis, so that elements ofmusical understanding can be acted upon in a way that minimises cognitive dissonance.

9

Architectural Knowledge Claims

The Jambot’s architecture is a complex hierarchical structure with communicativefeedback between layers in the hierarchy. This architecture encompasses all of thestages of reception, analysis and generation. Feedback between the stages means thatthey are not independent. This research demonstrates that such an architecture is effec-tive for an interactive music system, and that the use of feedback increases robustness.

The most common architectures used for music representation, and for modelling cog-nitive structures involved in music perception, are trees and Markov chains. I presentan argument that these simple architectures are inappropriate for music representa-tion or cognition (§4.3). Rather, a complex hierarchical structure with communicativefeedback, such as the Jambot’s architecture, is required to capture both local and globalfeatures of music, and to provide for robust perception.

The Jambot’s architecture implements simple attentional mechanisms (§8.4.2). I presentan argument that byproducts of attention such as curiosity and distractibility can createan impression of ‘lifelike’ behaviour in an agent (§8.2.5).

Attention facilitates one source of feedback in the Jambot’s architecture through the useof attentional modulation of onset expectation (§6.7), which involves feedback fromthe analysis stage to the reception stage. The modulation changes the signal/noiseratio required to report an onset, making it easier for an onset to be detected close toa predicted beat. The use of attentional modulation increases the robustness of thebeat-tracking algorithm.

The Jambot

Finally, the Jambot itself represents a significant contribution to knowledge. All ofthe knowledge claims listed above are demonstrated by, and embodied in, a completeoperational concert-ready system.

Formally, this project has been practice-led, with a 70% theoretical to 30% practicalweighting. For the practical component I present the Jambot itself. The accompanyingCD contains complete source code documentation, as well as the Jambot program.

A short demonstration of the main features of the Jambot can be viewed in the fileMainDemo.mov on the accompanying CD.

10

1.3 Guide to the Accompanying CD

The data CD that accompanies this thesis contains the Jambot software, documentationfor the code, and supplementary audio-visual materials referred to in the thesis. In thissection I give a brief description of these materials, organised by folders of the CD.

A number of these files are Quicktime movies. The Quicktime player for Windowsand MacOSX can be downloaded from (Apple 2011).

Root Folder

thesis.pdf: pdf version of this document.

1 Introduction

MainDemo.mov: demonstrates the main features of the Jambot, and gives an exampleof jamming to a prerecorded track.

RobobongoAllstars.mov: documentation of the Jambot jamming with a live ensem-ble – the Robobongo Allstars. This performance took place at the 2009 AustralasianComputer Music Conference at QUT in Brisbane.

RobobongoAllstars2.mov: another performance of the Jambot with the RobobongoAllstars. This performance took place at the 2008 Ignite Conference at QUT in Bris-bane.

2 Reception

Polyphonic Pitch Tracking: this folder contains examples of the Jambot augmentinga short segment of a prerecorded track, by mimicking the pitch material that it perceives(§5.3.6). The original track segments, and the corresponding augmented files, are:

mas que nada.mp3 mas que nada jam.mp34 ton mantis.mp3 4 ton mantis jam.mp3sweet.mp3 sweet jam.mp3

Stochastic Onset Detection:kick and synth.mp3: an audio sample of a synth note playing over a kick drum sound

11

(§5.2.1). This example demonstrates the difficulty with using amplitude envelopebased onset detection.

JungleBoogie bq.mp3 JungleBoogie hf.mp3 JungleBoogie nd.mp3These files are augmentations of a short audio sample, overlaid with a click at the timeswhen and onset is detected, using three different onset detection techniques: the bonk∼(Puckette et al. 1998), HFC (Masri & Bateman 1996) and SOD techniques respectively(§5.2.5). Also included is the unaugmented sample file JungleBoogie.mp3.

train* bq.mp3 train* hf.mp3 train* nd.mp3These audio files are the similar to the JungleBoogie *.mp3 files above, but applied to asubset of the training data from the MIREX (2006) tempo tracking data set (§5.2.5).

3 Analysis of Metre

MirexPracticeData.zip: This is the MIREX (2006) tempo tracking practice data set.The subset of these audio files containing percussive elements is used for algorithmevaluation in several places in this thesis.

MirexJambot: track*.mp3These files are audio examples of the Jambot’s beat tracking, applied to a subset of theMIREX (2006) tempo tracking data set (§6.6).

MirexBeatRoot: BeatRootTrack*.mp3These files are audio examples of the BeatRoot’s beat tracking, applied to a subset ofthe MIREX (2006) tempo tracking data set (§6.6).

4 Generation

Reactive: this folder contains video documentation of exemplar artworks supportingthe arguments made in the section on Reactive Generation (§8.2).BernardLubat.mp4: an example of the Continuator (Pachet 2002) interacting with anexperienced jazz musician (§8.2.2).Children.mp4: the Continuator (Pachet 2002) interacting with children (§8.2.2).petit mal.mp4: video documentation of Simon Penny’s (2011) Petit Mal (§8.2.2).SAM.mpg: documentation of Edward Ihnatowicz’s (2011) Sound Activated Module(§8.2.2).senster.mpg: documentation of Edward Ihnatowicz’s (2011) Senster (§8.2.2).

12

Proactive/Chimaera: this folder contains examples of the Chimæra process for am-biguous rhythm generation (§8.3.10).chimaera.mov: video demonstration of the Chimæra process for ambiguous rhythmgeneration.

The following three files are examples of the different modes that the Chimæra em-ploys: disambiguation, ambiguation and following (§8.3.10). These files are record-ings of the Jambot augmenting a simple drum loop: example file original.mp3.disambiguate.mp3 ambiguate.mp3 follow.mp3

Proactive/Anticipatory Timing: this folder contains examples of the Jambot’s rhythmgeneration with, and without, the Anticipatory Timing search technique (§8.3.6). Alsoincluded is the input file AmenBreak.mp3. The files are:

AlternatingSnareAndHatAnticipation.mp3 AlternatingSnareAndHatNoAnticipation.mp3DanceHallAnticipation.mp3 DanceHallNoAnticipation.mp3EnsembleAnticipation.mp3 EnsembleNoAnticipation.mp3

5 Discussion

BeatTracking.mov: demonstrates the Jambot’s beat tracking facility (§6.2.2).

AttentionalModulation.mov: demonstrates the stabilising effect of modulating the on-set detection thresholds according to metric expectations (§6.7).

A Software

The files in this folder work together as a working demonstration of the Jambot soft-ware. Refer to the README.html file for installation and usage instructions.README.rtf: Installation and usage instructions for the Jambot demonstration.impromptu.app: Andrew Sorensen’s free audio programming environment (Sorensen2005).Ybot.component: the Jambot itself, implemented as an audio unit plugin.jambot-demo.scm: Impromptu source code for loading and demonstrating the Jambot.AmenBreak.wav: example audio file for use in this demonstration.

13

1.4 Associated Publications

Aspects of this doctoral research have been published in peer-reviewed conferenceproceedings. The publication references are listed below:

Gifford, T & Brown, A (2011). ‘Beyond Reflexivity: Mediating between imitative andintelligent action in an interactive music system’. In British Computer Society Human-Computer Interaction Conference, July 2011. British Computer Society, Newcastleupon Tyne.

Gifford, T & Brown, A (2010). ‘Anticipatory Timing in Algorithmic Rhythm Gener-ation’. In Opie, T, ed., Australasian Computer Music Conference, June 2010, 21-28.ACMA, Canberra.

Brown, A & Gifford, T & Davidson, R and Narmour, E (2009). ‘Generation in Con-text’. In Stevens C, Schubert E, Kruithof B, Buckley K, & Fazio S, eds., 2nd Inter-national Conference on Music Communication Science, Dec 2009, 7-10. HCSNet,Sydney.

Gifford, T & Brown, A (2009). ‘Do Androids Dream of Electric Chimera?’. InSorensen, AC, ed., Australasian Computer Music Conference, July 2009, 53-56. ACMA,Brisbane.

Gifford, T & Brown, A (2008). ‘Stochastic Onset Detection: An approach to detectingpercussive onset attacks in complex audio’. In Wilkie, S & Hood, A, eds., AustralasianComputer Music Conference, June 2008. ACMA, Sydney.

Gifford, T & Brown, A (2007), ‘Polyphonic Listening: Real-time accompaniment ofpolyphonic audio’. In Australasian Computer Music Conference, June 2007. ACMA,Canberra.

Gifford, T & Brown, A (2006). ‘The Ambidrum: Ambiguous generative rhythms’. InAustralasian Computer Music Conference, July 2006. ACMA, Adelaide.

14

Chapter 2

Background

2.1 Introduction

The Jambot is an interactive music system – a computer system for live performancewith humans. The first section of this chapter gives a brief overview of the field ofinteractive music systems, highlighting some areas in which relatively little researchhas taken place, and which this project has explored. One such area is the use ofgenerative algorithms based on salient musical attributes. Another is the combinationof transformative techniques with generative techniques.

The second section outlines some approaches to improvisation identified from the liter-ature, and my own practice as an improvising musician. A key improvisatory strategyis identified: striking a balance between novelty and coherence. Some theories ofmusic perception relating to musical expectations and ambiguity are introduced, mak-ing the case that manipulation of the level of ambiguity present in the improvisationprovides a mechanism for altering the balance between novelty and coherence. Thelevel of ambiguity is one of the salient musical attributes that the Jambot uses in itsgenerative processes.

The remaining sections discuss theories and concepts relating to musical pulse andmetre. The tactus, tatum and density referent pulses are described. Some competingtheories of metre are introduced, and a general view of metre as an expectational frame-work is presented. Finally the manipulation of metric ambiguity as an improvisatorydevice is suggested.

2.2 Interactive Music Systems

The Jambot is an interactive music system. Interactive music systems are computersystems for musical performance, in which one or more human performers interactwith the system in live performance. The computer system is responsible for partof the sound production, whether by synthesis or by robotic control of a mechan-ical instrument. The human performers may be playing instruments, manipulatingphysical controllers, or both. The system’s musical output is affected by the human

15

performer, either directly via manipulation of synthesis or compositional parametersthrough physical controllers, or indirectly through musical interactions.

There exists a large array of interactive music systems, varying greatly in type, rangingfrom systems that are best characterised as hyperinstruments to those that are essen-tially experiments in artificial intelligence. The type of output varies from systemsthat perform digital signal processing on input from an acoustic instrument, throughsystems that use techniques of algorithmic improvisation to produce MIDI output, tosystems that mechanically control physical instruments. Most systems are designedfor a single human performer, and when in a band setting will generally track a singleinstrument, typically the drummer. The Jambot differs in this respect by analysing theaggregate signal of the whole band.

The following brief discussion of interactive music systems is not intended to be acomprehensive review, but rather to outline a general classification scheme for suchsystems, with a few key exemplars, so as to situate the Jambot within this field. Moredetailed surveys of interactive music systems may be found in Rowe (1993; 2001),Dean (2003) and Collins (2006b).

Rowe (1993) describes a multidimensional taxonomy of interactive music systems.One dimension of this taxonomy classifies systems as transformative or generative.Transformative systems transform incoming musical input (generally from the humanperformer playing an instrument) to produce output. Generative systems utilise tech-niques of algorithmic composition to generate output. Rowe also discusses a thirdcategory of sequencing, however in this discussion I will consider sequencing as asimple form of generation. This categorisation is somewhat problematic, in that sys-tems may be composed of both transformative and generative elements. None-the-lessit provides a useful launching point for discussion.

Transformative systems have the capacity to be relatively robust to a variety of mu-sical styles. They can benefit from inheriting musicality from the human performer,since many musical features of the input signal may be invariant under the transforma-tions used. A limitation of transformative systems is that they tend to produce outputthat is either stylistically similar (at one extreme), or musically unrelated (at the otherextreme), to the input material.

Generative systems use algorithmic composition techniques to produce output. Theappropriateness of the output to the input is achieved through more abstract musicalanalyses, such as beat tracking and chord classification. Generative systems are able toproduce output that has a greater degree of novelty than transformative systems. They

16

are often limited stylistically by the pre-programmed improvisatory approaches, andmay not be robust to unexpected musical styles.

Within the class of transformative systems is the subclass of reflexive (Pachet 2006)systems. Reflexive systems are transformative in the sense that they manipulate theinput music to produce an output. The manipulations that they perform are designedto create a sense of similarity to the input material, but without the similarity beingtoo obvious. Pachet describes reflexive systems as allowing the user to “experiencethe sensation of interacting with a copy of [themselves]” (ibid:360). Reflexive systemsaim to model the style of the input material, for example using Markov models trainedon a short history of the input. Reflexive systems enjoy the benefits of transformativesystems, namely inheriting musicality from the human input, and so are robust to avariety of input styles. The use of abstracted transformations means that they canproduce surprising and novel output whilst maintaining stylistic similarity to the input.Reflexive systems do not, however, perform abstract musical analyses such as beat-tracking.

The Jambot is designed to combine transformative and generative approaches. In thisway it hopes to achieve the flexibility and robustness of a transformative system, whilstallowing for aspects of abstract musical analysis to be inserted into the improvisation.The idea is to operate from a baseline of transformed imitation, and to utilise momentsof confident understanding to deviate musically from this baseline.

Another limitation of many reflexive and generative systems is they model music usingstatistical/mathematical machinery such as Markov chains, neural nets, genetic algo-rithms and the like. The difficulty with these models is they do not directly exposesalient musical features. This means the parameters of these models do not affordintuitive control over the musical features of their output. The Jambot utilises a rep-resentation of musical rhythm that parameterises rhythm space into musically salientfeatures. This way the Jambot’s generative processes may be controlled intuitively.

17

2.2.1 Examples of systems

In this section I list a few prominent examples of existing interactive music systems.Many more examples can be found in Rowe (1993; 2001), Dean (2003) and Collins(2006b).

M and Jam Factory: Two early interactive music systems, M and Jam Factory, werepioneered by Chadabe & Zicarelli (Zicarelli 1987). These systems were MIDI basedsystems that captured input in Markov models based on pitch, duration and loudness.They created real-time output based on the Markov models with some variability overthe generative strategies.

Oscar: Another pioneering system was Peter Beyl’s Oscar system which he describesas “a program with an artificial ear. The program tries to express its own personal char-acter while simultaneously aiming for integration into a larger social whole” (1988:219).The Oscar system takes as input an audio signal and saxophone key data, the analysisis focused on pitch material and it outputs a MIDI signal.

Band-out-of-a-Box: An interesting example of an improvisational interactive sys-tem is provided by Belinda Thom (2000a). Her system, BoB, learns to improvise withanother improviser, and attempts to emulate their style. The heart of her system is avariable-sized-multinomial-mixture model, which gains a knowledge of a given hu-man’s improvisation style by a process of unsupervised training.

B-Keeper: Andrew Robertson & Mark Plumbley (2007) have developed a systemcalled B-Keeper, which is designed as a beat-tracking system for standard drum-kits.It outputs a varying tempo estimate which can be used to control the playback rate ofa sample or sequence.

The Continuator: Francois Pachet’s (2002) Continuator is an example of a reflexivesystem. The Continuator operates by sampling short phrases of input from a MIDIkeyboard. It trains a Markov model on this short phrase, and then generates a responsefrom this model. It re-uses the same rhythm as in the original phrase.

Omax: Developed at IRCAM, the Omax system operates in the reflexive paradigm,accompanying a live acoustic musician (Assayag et al. 2006). The Omax system usesthe Factor Oracle, a type of statistical learning algorithm, to model the input, andproduce reflexive output.

18

2.3 Approaches to Improvisation

A successful performance should balance the predictable withthe unpredictable.

Borgo (2002)

The Jambot is an improvisatory agent. Its goal is to provide appropriate improvisedaccompaniment in an ensemble setting. In order to codify a sense of musical appropri-ateness (or complementarity) I have drawn upon theories of music perception, discus-sions of improvisatory approaches by practising musicians, and my own experience asan improvising musician.

The main insight I have utilised is that in improvisation one must strike an appropriatebalance between novelty and coherence. In the sections below I elaborate on this idea,and introduce some theories of music perception relating to musical expectations andambiguity. The gist of the argument I present below is that multiple expectations giverise to ambiguities, and manipulation of the level of ambiguity present in the improvi-sation provides a mechanism for altering the balance between novelty and coherence.

2.3.1 Free Improvisation

In examining improvisatory processes in humans I have paid particular attention to freeimprovisation. Free improvisation refers to the spontaneous production of music withlittle or no pre-arranged musical structure. Free improvisation has played a prominentrole in my practice as a musician, and is of interest due to the importance it places onensemble interaction. I believe that lessons learnt from free improvisation have valuein broader improvisatory contexts.

Thomas Nunn goes as far as to say that free improvisation “is the essence of all formsof improvisation” (1998:7). Whilst this is stating the case more strongly than I would,I nevertheless share the conviction that elements of free improvisation have broad ap-plicability in any improvisatory setting. Moreover, the removal of all structure placesthese elements in bas-relief and consequently, I suggest, free improvisation provides avaluable arena for study of improvisation in general.

The Jambot improvises freely – it has no prior knowledge of the structure of the music.It operates ‘in the moment’. The temporal scope of the Jambot’s musical analysis is

19

quite short, at most a couple of bars. It has no notion of phrasing, hyper-measure, orform. I do not mean to dismiss the importance of such structures in free improvisation– indeed in my experience they are pivotal to the success of an improvisation. How-ever, at this stage of development the Jambot seeks only to act appropriately in a verylocalised sense. Higher order temporal structures are a topic for further research.

It may be worth reiterating at this point that the Jambot currently is restricted to per-cussive output. Whilst pitch-based improvisation is of great interest to me (and wasinitially part of the design brief for the Jambot) the quality of the polyphonic pitchtracking algorithms developed (described in §5.3) was not sufficient to allow for this.In this sense the Jambot is quite different from many computational improvisatoryagents, which have frequently been designed for melodic jazz improvisation (over aknown harmonic progression) (Thom 2000b; Biles 2002; Keller et al. 2006).

I have, however, done some experimentation with pitch-based improvisation that didnot rely upon pitch-based listening – this was achieved by choosing a key and lettingthe human performers adapt to the Jambot’s harmonic motions. An example of thiscan be seen in the accompanying demonstration video RobobongoAllstars.mov.

2.3.2 Balancing Novelty and Coherence

In my experience of free improvisation there is a constant tension between maintaininga coherent foundational structure and keeping the music interesting. Free jazz saxo-phonist David Borgo comments:

When a listener (or performer) hears what they expect, there is a low com-plexity and what is called “coherence” ... and when they hear somethingunexpected, there is “incoherence” ... this creates a dilemma for improvis-ers, since they must constantly create new patterns, or patterns of patterns,in order to keep the energy going, while still maintaining the coherence ofthe piece. Borgo (2004)

Part of the art of improvisation (and composition) is to strike the right balance betweencoherence and novelty. For the listener to perceive coherent and interesting structure,there must be some element of surprise, but not so much that the listener loses theirbearings entirely.

‘good’ music ... must cut a path midway between the expected and theunexpected ... If a works musical events are all completely unsurprising... then the music will fulfill all of the listener’s expectations, never be

20

Figure 2.1: Coherence metre

surprising – in a word, will be boring. On the other hand, if musical eventsare all surprising ... the musical work will be, in effect, unintelligible:chaotic. (Kivy 2002:74)

The Jambot attempts to strike an appropriate balance between coherence and noveltyby maintaining an ongoing measure of the level of coherence in the improvisation.Figure 2.1 shows a whimsical portrayal of a ‘coherence metre’, displaying a real-timemeasure of coherence. A target level of coherence is set either by a human or by ahigher order generative process. The Jambot then takes musical actions to alter thelevel of coherence to maintain this target.

In order to model the coherence level of the improvisation, I have utilised notions ofambiguity and expectation. Borgo and Kivy (above) both identify expectations regard-ing future musical events as a key contributor to the sense of coherence of the improvi-sation. By creating multiple expectations, a sense of ambiguity can be created, whichin turn decreases the coherence level. Conversely by highlighting a single expectation,ambiguity is decreased and coherence increased. In the next sections I discuss sometheories from music perception regarding musical expectations, and their relation tonotions of musical ambiguity.

2.3.3 Expectation

The importance of taking account of the dynamic nature of musical expectations whenconsidering musical experience has been acknowledged in the music theory literaturefor some time (Lerdahl & Jackendoff 1983; Meyer 1956; Narmour 1990; Barucha1993) but has only recently been translated into computational descriptions and rarely

21

been the basis for algorithmic music systems. Meyer suggests that affect in music per-ception can be largely attributed to the formation and subsequent fulfilment or violationof expectations. His exposition is compelling but imprecise as to the exact nature ofmusical expectations and to the mechanisms of their formation.

A number of extensions to Meyer’s theory have been proposed, which have in commonthe postulation of at least two separate types of expectations; structural expectationsof the type considered by Meyer, and dynamic expectations. Narmour’s (1990) theoryof Implication and Realisation, an extension of Meyer’s work, posits two cognitivemodes; one of a schematic type, and one of a more innate expectancy type. Barucha(1993) also discriminates between schematic expectations (expectations derived fromexposure to a musical culture) and veridical expectations (expectations formed on thebasis of knowledge of a particular piece).

Huron (2006) has recently published an extensive and detailed model of musical ex-pectations that builds further on this work. He argues that there are, in fact, a numberof different types of expectations involved in music perception, and that indeed theinterplay between these expectations is an important aspect of the affective power ofthe music. Huron extends Bharucha’s categorisation of schematic and veridical ex-pectations, and in particular makes the distinction between schematic and dynamicexpectations.

Dynamic expectations are constantly learned from the local context. Several authorshave suggested that these dynamic expectations may be represented as statistical infer-ences formed from the immediate past (Huron 2006; Pearce & Wiggins 2006). LikeBharucha, Huron argues that the interplay of these expectancies is an integral part ofthe musical experience.

2.3.4 Ambiguity

Meyer (1956) identifies ambiguity as a mechanism by which expectations may be ex-ploited for artistic effect. In this context ambiguity refers to musical surfaces thatcreate several disparate expectations. The level of ambiguity in the music creates acycle of tension and release, which forms an important part of the listening experiencein Meyer’s theory. An ambiguous situation creates tension – the resolution of which ispart of the art of composition.

Ambiguity is important because it gives rise to particularly strong ten-sions and powerful expectations. For the human mind, ever searching forthe certainty and control which comes with the ability to envisage and

22

predict, avoids and abhors such doubtful and confused states and expectssubsequent clarification. (Meyer 1956:27)

Temperley notes that ambiguity can arise as the result of multiple plausible analysesof the musical surface:

Some moments in music are clearly ambiguous, offering two or perhapsseveral analyses that all seem plausible and perceptually valid. These twoaspects of music – diachronic processing and ambiguity – are essential tomusical experience (Temperley 2001:205)

I have been discussing ambiguity as inversely related to coherence. However, the no-tion of ambiguity has an extra nuance that is worth mentioning. Certainly, an un-ambiguous (musical) situation should be highly coherent. A high level of ambiguity,however, should not be confused with vagueness; where vagueness implies a lack ofany strong suggestion, ambiguity implies a multiplicity of strong suggestions.

2.3.5 Multiple Parallel Analyses

The concept of systems of musical analysis that yield several plausible results has beenposited by a number of authors as a model of human musical cognition. Notably, Jack-endoff (1992:140) proposed the mutiple parallel analysis model. This model, whichwas motivated by models of how humans parse speech, claims that at any one time ahuman listening to music will keep track of a number of plausible analyses in parallel.

In a similar vein, Huron (2006) describes the competing concurrent representation the-ory. He goes further to claim that, more than just a model of music cognition, “Com-peting concurrent representations may be the norm in mental functioning” (2006:108).

2.3.6 Ambiguity in Multiple Parallel Representations

An analysis system that affords multiple interpretations provides a natural mechanismfor the generation of ambiguity. In discussing their Generative Theory of Tonal Music(GTTM), Lerdahl & Jackendoff observe that their “rules establish not inflexible deci-sions about structure, but relative preferences among a number of logically possibleanalyses” (1983:42), and that this gives rise to ambiguity. In saying this Lerdahl &Jackendoff are not explicitly referencing a cognitive model of multiple parallel analy-ses; the GTTM predates Jackendoff’s construction of this model, and does not consider

23

real-time cognition processes. Indeed it was considerations of the cognitive constraintsinvolved in resolving the ambiguities of multiple interpretations that led Jackendoff toconclude that the mind must be processing multiple analyses in parallel (Jackendoff1992).

Temperley (2001:219) has revisited the preference rule approach to musical analysesin a multiple parallel analyses model:

The preference rule approach [is] well suited to the description of ambi-guity. Informally speaking, an ambiguous situation is one in which, onbalance, the preference rules do not express a strong preference for oneanalysis over another ... At any moment, the system has a set of “best-so-far” analyses, the analysis with the higher score being the preferred one.In some cases, there may be a single analysis whose score is far above allothers; in other cases, one or more analyses may be roughly equal in score.The latter situation represents synchronic ambiguity.

In a similar spirit, Huron (2006:109) argues that multiple parallel analyses (or compet-ing concurrent representations, as he calls them) must all be generating expectations,and consequently must give rise to the kind of expectational ambiguity that was arguedabove to play a central role in producing musical affect.

2.4 Musical Pulse

An important skill in ensemble improvisation is an ability to play in time. Termsrelating to musical time, such as beat and tempo, are often explained with reference toa regular pulse. The sections below outline some pulses commonly discussed in theliterature; the tactus, the tatum, and the density referent.

2.4.1 The Tactus

The notion of a referent pulse level is a common construct in analyses of musicalrhythm (whether performed computationally for beat tracking, or manually for musi-cological analysis), and this referent level is often identified as being the beat, or tactus.The tactus is generally described as being the most salient pulse level (Parncutt 1994),or the ‘toe-tapping’ pulse – the tempo at which a listener would tend to tap their toe,or clap their hands along to. (Newman 1995)

The listener tends to focus primarily on one (or two) intermediate level(s)in which the beats pass by at a moderate rate. This is the level at which

24

the conductor waves his baton, the listener taps his foot, and the dancercompletes a shift in weight. Adopting a Renaissance term, we call suchlevel the tactus. (Lerdahl & Jackendoff 1983)

From a computational standpoint the notion of tactus presents some difficulties, partic-ularly when trying to estimate the tactus in a musical audio stream. The problem stemsfrom the fact that, like metre, the tactus is a perceptual construct, and is not even uni-formly perceived between different people. As an example, each year a computationaltempo tracking contest is run by the Music Information Retrieval Exchange (MIREX)community, and the ground-truth tactus values for the sample audio tracks (which aredetermined by measuring the rate at which a group of test-subjects tap along to a givenpiece) are generally given as two values with relative probabilities (corresponding tothe top two choices amongst the test-subjects and their relative frequencies). By andlarge the different choices of tactus are closely related, having tempi in a ratio of 2 : 1,but nevertheless this highlights the difficulties that can be expected in modelling thetactus when it is defined as a perceptual quality rather than a surface property of themusic.

2.4.2 The Tatum

An alternative temporal reference pulse dubbed the tatum was suggested by Bilmes(1993) in his study of expressive timing. The tatum, a whimsical contraction of tem-poral atom in honour of Art Tatum, is the fastest commonly occurring pulse evidentin the surface of the music. The term was conceived of in a computational context;Bilmes was modelling microtiming and expressive tempo variations in percussion per-formances, and needed a quantized unit of time in which to measure the rhythms. Thisis reminiscent of the step sequencer metaphor prevalent in computer music where thetatum corresponds to the quantization period of the step.

The tatum is the high frequency pulse or clock that we keep in mind whenperceiving or performing music. The tatum is the lowest level of the met-rical hierarchy. We use it to judge the placement of all musical events.(ibid:21)

The tatum is not necessarily constant through the course of a piece of music (ibid:109);the tatum may change to reflect a different predominant level of subdivision within asingle piece.

25

2.4.3 Referent Pulse Level

In their theory of dynamic attention, Jones & Boltz posit a referent pulse level anddescribe it as an “anchor or referent time level for the perceiver” (1989:470), whichother periodicities are interpreted in terms of, either as subdivisions (for faster pulses)or groupings (for slower pulses). The suggestion is that the perception of pulses aboveand below this reference pulse operate via different cognitive mechanisms:

[Jones & Boltz’s] temporal perception model relates smaller and larger pe-riods of attention to a middle level that she refers to as the referent timeperiod. The temporal referent anchors our attentional process, and me-diates between analytic attending (awareness of local details) and future-oriented attending (awareness of more global processes and goals). (Lon-don 2004:18)

Although Jones’s original exposition does not explicitly mention the tactus, Londongoes on to identify this referent level as being the tactus:

In metric terms, the beat or tactus serves as the referent level. Jones’ modelthus suggests that beat subdivisions are the product of analytic attending,that is we grasp them as fractions of a larger span. Conversely, larger levels- measures - are yoked to expectations that derive from the referent levelsuch as anticipating that an important event will occur “in two beats” or“every three beats” (London 2004:18)

Some doubt is cast on this identification by an examination of Western African per-cussion music, in which the tactus does not function in the same manner as in Westernmusic (Arom 1991:206); if there was a fundamental split in the perceptual processesabove and below the tactus, then one might expect the tactus to function universallyacross musical cultures.

2.4.4 Pulsation

Analysts of Western African music have found a more useful concept variously knownas the Common Fast Beat (Kauffman 1980:396), density referent (Hood 1971:114) orpulsation (Arom 1991:202) which plays the role of a referent pulse level in terms ofwhich all other musical events are measured.

Pulsations are an uninterrupted sequence of reference points with respectto which rhythmic flow is organised. All the durations in a piece ... aredefined in relationship to the pulsation (Arom 1991:202)

26

It seems almost undebatable that most African rhythms can be related to afast regular pulse. Density Referent seems to be the term that is increas-ingly used to identify these fast pulses ... Musical scholars are probablymore aware of the density referent than are performers of the music, butthe absolute accuracy of rhythmic relationships in performance seems toattest to at least an unconscious measuring of fast, evenly paced units.(Kauffman 1980:407)

The density referent seems more likely to be amenable to computational estimationthan the tactus as it is not a perceptual quality, but relates to the surface properties ofthe music:

the density referent ... can be used to study and understand temporal ele-ments that would be rendered ambiguous to more subjective concepts ofbeat. (Kauffman 1980:396)

The pulsation may differ from the tatum. The idea of the tatum is that it is the fastestpulse which actually occurs in the surface music; and that all other pulses (and inter-onset-intervals) in the music are some multiple of the tatum – it is fundamentally theunit of time in which the music is measured. The pulsation similarly sets up a gridin terms of which all other musical events are to be measured. However, unlike thetatum, the pulsation is not necessarily the fastest pulse – subdivisions of it may appearin the surface of the music. Whilst the tatum may change during a piece to reflecta different predominant subdivision (Bilmes 1993:109), the pulsation should remainconstant (Arom 1991:202).

2.5 Musical Metre

Musical metre is frequently described as the pattern of strong and weak beats in amusical stream (Cooper & Meyer 1960; Lerdahl & Jackendoff 1983; Large & Kolen1994). From the point of view of music psychology metre is understood as a perceptualconstruct, in contrast to rhythm which is a phenomenal pattern of accents in the musicalsurface (Lerdahl & Jackendoff 1983; London 2004).

Metre is inferred from the surface rhythms, and possesses a kind of perceptual inertia.In other words, once established in the mind, a metrical context tends to persist evenwhen it conflicts with the rhythmic surface, until the conflicts become too great:

Once a clear metrical pattern has been established, the listener renounces itonly in the face of strongly contradicting evidence (Lerdahl & Jackendoff1983:17).

27

The Jambot implements perceptual inertia for its beat tracking (§6.4.6) and metre in-duction (§6.8.3).

Cooper & Meyer define metre as “the measurement of the number of pulses betweenmore or less regularly recurring accents” (1960:4) whilst Yeston describes metre as “anoutgrowth of the interaction of two levels — two differently rated strata, the faster ofwhich provides the elements and the slower of which groups them” (1976:66).

A related but more elaborate description of metre is given by Lerdahl & Jackendoff(1983) in their Generative Theory of Tonal Music (GTTM). They propose a represen-tation of metre which reflects a hierarchy of timescales. In this representation, a beat ofany given level is assigned a strength according to the number of levels in the hierarchythat contain this beat.

2.5.1 Metre as an Expectational Framework

Within the music perception field metre is generally considered as an expectationalframework against which the phenomenal rhythms of the music are interpreted (Lon-don 2004; Huron 2006). Jones (1987), for example, argues that metre should be con-strued as a cognitive mechanism for predicting when salient musical events are ex-pected to happen. This description of metre has been widely accepted within the musicpsychology community (Huron 2006; Large 1994; London 2004).

Metrical structure provides listeners with a temporal framework upon whichto build expectations for events (Large & Kolen 1994)

London adopts this view of metre, and develops it further to a description of metresimilar to the GTTM in that it involves a hierarchy of pulse levels, but different in thatit takes the tactus pulse as the primary reference level, and posits different cognitivemechanism for the perception of levels above the tactus (grouping) and levels belowthe tactus (subdivion) as discussed in §2.4.3.

London’s model of metre is rooted in Jones & Boltz’s (1989) theory of dynamic atten-tion. This theory contends that a listener’s attention varies through time according tothe metre, so that attention is greatest at metrically strong points. The Jambot imple-ments a similar form attentional modulation to help filter spurious onsets (§6.7).

Huron similarly treats metre as a distribution of expectational strengths across beat-classes, and claims that metric strength correlates with the probability of an onset(2006:179). Huron’s model of music perception is premised on the notion that much

28

(if not all) of our musical understanding is derived from learned expectations due tolong exposure to statistical regularities in music that we listen to.

The description of metre as an interaction between two (or more) isochronous pulsesis not suitable for some styles of music. The recognition of non-isochronous pulseshas become common in more recent writings in music perception (Huron 2006; Lon-don 2004). London insists that the unequal beat periods must be rationally related,and indeed that such pulses arise as uneven groupings of a faster isochronous pulse(2004:100). However, Moelants (1999) observes that Bulgarian dance music containsnon-isochronous pulses with irrational ratios of duration, and Huron (2006:188) sug-gests the same of Viennese Waltzes.

Another example of possibly irrational subdivisions of a pulse is the case of swing.Although swing is often described as arising from a triplet subdivision of the tactus,actual swing timings occupy a spectrum, and may subdivide the tactus arbitrarily (butconsistently) (Lindsay & Nordquist 2007).

Both Huron and London subscribe to the view of metre as an expectational frame-work, in which temporal positions of beats need only form some predictable pattern.Isochrony yields a particularly simple predictive mechanism. For London this is theonly mechanism at work in metre perception, but for Huron any (even irrational) sub-division of the bar provides a predictable template given sufficient exposure.

The general view of metre as a predictive template fits a wide variety of musical styles,encompassing both metres that arise out of isochronous pulses, and those that are betterdescribed as an irrational (but consistent) subdivision of a single isochronous pulse(Collins 2006a:29).

The target musical genres for this project are popular contemporary Western dancestyles, such as Electronic Dance Music (EDM), jazz, funk and pop. The (less general)description of metre as arising from a referent pulse level, and a periodic patterning ofthis pulse into strong/weak beats, appears to be broadly applicable in these styles. TheJambot implements this view of metre. However, conceiving of metre as an expecta-tional framework suggest a mechanism for manipulating ambiguity in improvisation.

29

2.6 Utilising Metrical Ambiguity

The view of metre as an expectational framework creates the possibility for manipu-lating the level of metric ambiguity as an improvisatory device. As discussed in §2.3.2the Jambot seeks to maintain a target level of ambiguity in the ensemble improvisa-tion. The Jambot does this by simultaneously reinforcing a multiplicity of metricalpossibilities when it believes the coherence level to be too high. The multiplicity ofexpectations creates ambiguity, which decreases the coherence. Conversely if the co-herence level is assessed to be too low (i.e. the improvisation has become too chaotic),then the Jambot will strongly reinforce the most plausible metric possibility to lowerthe ambiguity level.

The Jambot achieves this via two different mechanisms. One mechanism makes useof different rhythmic variables (timbre, dynamics, duration) to give disparate signalsregarding the underlying metre – this technique is discussed in §7.3.7. The secondmechanism utilises a generative process that I have dubbed the Chimaera, which willbe described in §8.3.10. The Chimaera draws upon the multiple parallel analyses ofmetre performed by the Jambot, and seeks to take musical actions to manipulate thelevel of ambiguity in the music by selectively highlighting particular analyses.

30

Chapter 3

Methodology

3.1 Introduction

This research project consisted of designing, implementing and evaluating an interac-tive music system according to a design research methodology (§3.5.1). In this chapterI describe the mechanics of this process, the theoretical perspective and epistemologyframing the research, and argue that this framework is a valid approach to investigatingthe research questions posed.

The design brief for the project was to create a robust concert-ready interactive mu-sic system that could listen to a raw audio stream, and produce real-time improvisedrhythmic accompaniment. This research is formally categorised as creative practiceresearch. The practice is computational instrument building, and the creative output isthe Jambot itself. In this chapter I argue that:

• The Jambot, as a complete operational interactive music system, in and of itselfrepresents a substantial contribution to knowledge in the engineering domain.

• Additionally, the process of developing and evaluating the Jambot has generatedknowledge contributions in the science domain and the creative arts domain.

• The Jambot as a computational artefact is a mechanism for the generation offurther knowledge in both science and creative arts.

This project has been interdisciplinary, straddling science, engineering and creativearts, with new knowledge generated in all three domains. Traditionally these differentdomains have adopted quite disparate epistemologies. In this chapter I argue that a par-ticular form of pragmatist epistemology (§3.3.2), known as Ecological Epistemology(§3.3.3), provides a view of knowledge compatible with all three of these domains.

Theories of knowledge are important to this project for several reasons, beyond purelymethodological concerns. This research delves into artificial intelligence and cognitivepsychology. Both of these fields model structural aspects of knowledge. I have adopted

31

a theoretical perspective on knowledge called Situated Cognition (§3.4.3). This per-spective gives a structural description of knowledge compatible with the abstract man-dates of Ecological Epistemology. The architecture of the Jambot has been designedaccording to the principles of Situated Cognition.

3.2 Research Framework

I am adopting Michael Crotty’s (1998) framework for discussing the research methods.He delineates a hierarchy of four components, Epistemology, Theoretical Perspective,Methodology and Method. Crotty (1998:3) defines these terms as follows:

• Epistemology: the theory of knowledge embedded in the theoretical perspectiveand thereby in the methodology.

• Theoretical Perspective: the philosophical stance informing the methodologyand thus providing a context for the process and grounding its logic and criteria.

• Methodology: the strategy, plan of action, process or design lying behind thechoice and use of particular methods and linking the choice and use of methodsto the desired outcomes.

• Method: the techniques or procedures used to gather and analyse data related tosome research question or hypothesis.

This framework is depicted in Figure 3.1, together with the particular approaches em-ployed in this project. The approaches I have used are:

• The epistemology that I have adopted is Ecological Epistemology (§3.3.3) – inwhich knowledge is viewed as an interaction between the knower and the known.

• The theoretical perspective I employ is the Brunswikian (§3.4.1) perspective ofSituated Cognition (§3.4.3), which views perception as an interaction betweenthe perceiver and the environment.

• The methodology is Design Research (§3.5.1).

• A number of methods were employed including reflective practice, software de-sign, computational modelling, interaction design and algorithmic composition.

In the sections below I describe each of these components, which together form thetheoretical framework for this research. The most important conceptual thread runningthrough all of these components is an emphasis on the importance of context.

32

methods

methodology

theoretical perspective

epistemology Ecological Epistemology

Situated Cognition

Design Research

Software DesignReflective PracticeAlgorithmic CompositionComputational ModellingInteraction Design

Figure 3.1: Theoretical Framework – adapted from (Crotty 1998:4)

3.3 Epistemology

Epistemology is the study of knowledge. Epistemology is pivotal to any researchproject, as it justifies why the research methods used can lead to credible new knowl-edge. Epistemology is important for this project in a number of respects. The researchgoals of this project include finding new insights into both artificial intelligence andhuman cognition. In both of these fields the study of knowledge is a central concernof the discipline. So beyond methodological concerns, theories of knowledge play amechanistic and structural role in this work.

The nature of this research project is interdisciplinary. On the one hand there arecontributions to knowledge in a variety of scientific fields – computational auditoryscene analysis, artificial intelligence, psychology. On the other hand this project isfirmly rooted in the creative arts, with knowledge contributions in interactive musicsystems, applied music theory, and the phenomenology of human-machine interaction.This interdisciplinary nature has highlighted some epistemological issues since thesciences and the creative arts frequently take differing epistemological stances.

3.3.1 Episteme, Techne and Praxis

Aristotle identified several categories of knowledge, including episteme, techne andphronesis. Flyvbjerg (2001:57) summarises these categories as:

• Episteme: Scientific Knowledge. Universal, invariable, context-independent.Based on general analytic rationality. The original concept is known today fromthe terms ‘epistemology’ and ‘epistemic’.

33

• Techne: Craft/Art. Pragmatic, variable, context-dependent. Oriented towardsproduction. Based on a practical instrumental rationality governed by a con-scious goal. The original concept appears today in terms such as ‘technique’,‘technical’ and ‘technology’.

• Phronesis: Ethics. Deliberation about values with reference to praxis. Prag-matic, variable, context-dependent. Oriented towards action. Based on practicalvalue-rationality.

These categories of knowledge were all regarded by Aristotle as “intellectual virtues,of which epistemic science, with its emphasis on theories, analysis, and universals wasbut one, and not even the most important” (Flyvbjerg 2001:57).

Corresponding to these three categories of knowledge Aristotle described three basicintellectual activities of mankind as theoria, poiesis and praxis (Hickman 1992:99).Modern usages of these terms have sometimes blurred the distinction between the in-tellectual activity of praxis and the corresponding intellectual virtue of phronesis (Fly-vbjerg 2001; Callaos 2011; Greenwood & Levin 2005:51). I will follow this conven-tion and adopt a trifold categorisation of knowledge into episteme, techne and praxis.

Science and Episteme

Mainstream contemporary thought in science holds episteme to be the only legitimateform of knowledge (Goldman 2004). From this perspective the products of technology,whilst vital to the actual practice of science, are not considered themselves to constituteknowledge (McCarthy 2006). Goldman (2004) argues that this perspective is the resultof a long history of philosophical prejudice towards Platonic ideals equating rationalitywith high culture. Dewey suggests that this prejudice is primarily historical:

If one could get rid one’s traditional logical theories and set to work afreshto frame a theory of knowledge on the basis of the procedure of the com-mon man, the moralist, and the experimentalist, would it be the forced orthe natural procedure to say that the realities which we know, which we aresure of, are precisely those realities that have taken shape in and throughthe active procedures of knowing? (1973:213)

Engineering and Techne

The nascent field of the Philosophy of Engineering rejects the primacy of epistemicknowledge, and argues that techne should be considered as equally legitimate. Theknowledge categories of episteme, techne, and praxis are also sometimes described interms of Ryle’s (1949) distinction between know-that and know-how, and Polanyi’s

34

(1974) notion of tacit knowledge respectively (Callaos 2011). McCarthy suggests thatengineering know-how should not be considered as purely a mechanism for enablingscientific know-thats:

engineering can be seen as delivering knowledge by a much more directroute than by aiding science. There is a useful distinction in philosophybetween ‘knowing that’ and ‘knowing how’ ... Engineering is ‘know-how’... [and] as a consequence yields highly successful knowledge about howto control materials and processes to bring about desired results. It is a wayof getting to the nature of things – a voyage of discovery as much as sci-ence is. Hence engineering provides a useful case study for philosophersinquiring about the status of human knowledge. (2006)

Goldman (2004) characterises the distinction between episteme and techne thus: epis-teme is fundamentally concerned with the necessary whilst techne is fundamentallyconcerned with the contingent, and locates engineering knowledge in the realm oftechne:

Engineering problem solving employs a contingency based form of rea-soning that stands in sharp contrast to the necessity based model of ra-tionality that has dominated Western philosophy since Plato and that un-derlies modern science. The concept ‘necessity’ is cognate with the con-cepts ‘certainty’, ‘universality’, ‘abstractness’ and ‘theory’. Engineeringby contrast is characterised by wilfulness, particularity, probability, con-creteness and practice. The identification of rationality with necessity hasimpoverished our ability to apply reason e

Improvisation in Interactive Music Systems · in an ensemble setting. The research was...

Documents

Transcript of Improvisation in Interactive Music Systems · in an ensemble setting. The research was...