3 7 f - UNT Digital Library/67531/metadc277632/... · K2000 Programs, Layers and Algorithms m ....

118
CIRCUMFUSION: A COMPOSITION FOR REAL-TIME COMPUTER MUSIC SPATIALIZATION SYSTEM DISSERTATION Presented to the Graduate Council of the University of North Texas in Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY By Christopher Robert Morgan, BM, MM Denton, Texas August, 1998 3 7 f "8U Mo, V 70/

Transcript of 3 7 f - UNT Digital Library/67531/metadc277632/... · K2000 Programs, Layers and Algorithms m ....

  • CIRCUMFUSION: A COMPOSITION FOR

    REAL-TIME COMPUTER MUSIC

    SPATIALIZATION SYSTEM

    DISSERTATION

    Presented to the Graduate Council of the

    University of North Texas in Partial

    Fulfillment of the Requirements

    For the Degree of

    DOCTOR OF PHILOSOPHY

    By

    Christopher Robert Morgan, BM, MM

    Denton, Texas

    August, 1998

    3 7 f "8U

    Mo, V 70/

  • Morgan, Christopher Robert, Circumfminn: A Composition for Real-time

    Computer Mu>sic Spatialization System. Doctor of Philosophy (Composition), August,

    1998,109 pp., 10 tables, 11 illustrations, 27 examples, references, 45 titles.

    Two of the leading methodologies for spatializing audio over multiple channels

    include non-real-time multi-track tape and variations of real-time systems that often

    involve complex configurations of hardware. Of the latter, composers relying on MIDI

    as a control source have used pairs of sound modules, effects units and automation-

    capable mixers to achieve spatialization over four loudspeakers. These systems typically

    employ intensity panning, Doppler shifts and reverberation.

    The present research details the development of a compact spatialization system

    using a MAX patch controlling a Kurzweil K2500 sampler. This system supports real-

    time diffusion of up to six simultaneous sound files over eight loudspeakers while

    incorporating intensity panning, Doppler shifts, delays and filtering. The MAX patch

    allows composers to choose from several automatic sound spatialization trajectories or to

    use the mouse to draw and store their own trajectories for later playback. The piece,

    Circumfusion, is an eighteen-minute composition of electroacoustic music utilizing this

    spatialization system.

  • CIRCUMFUSION: A COMPOSITION FOR

    REAL-TIME COMPUTER MUSIC

    SPATIALIZATION SYSTEM

    DISSERTATION

    Presented to the Graduate Council of the

    University of North Texas in Partial

    Fulfillment of the Requirements

    For the Degree of

    DOCTOR OF PHILOSOPHY

    By

    Christopher Robert Morgan, BM, MM

    Denton, Texas

    August, 1998

    3 7 f "8U

    Mo, V 70/

  • TABLE OF CONTENTS

    Page

    LIST OF TABLES

    LIST OF FIGURES

    LIST OF EXAMPLES

    Chapter

    1. INTRODUCTION !

    Overview Terminology Background of Music Spatialization Electroacoustic Works Employing Spatialization Sound Diffusionists Summary

    2. PSYCHOPHYSICS OF HUMAN LOCALIZATION 15

    Localization Cues The Head Related Transfer Function and Vertical Localization Moving Sound Sources Distance Cues Summary

    3. CURRENT SOUND SPATIALIZATION TECHNIQUES 26

    Multi-loudspeaker and Multitrack Systems Diagonal Crossfades Four Multichannel Spatialization Systems HRTF Spatialization Systems Summary

    4. PREVIOUS COMPOSITIONS UTILIZING EIGHT LOUDSPEAKER SYSTEMS 41

    Octophonic Etude I: Circles (1993) Octophonic Etude II: Hsu, Waiting (1995) K2000 Programs, Layers and Algorithms

    m

  • Octophonic Etude III for flute & Live Computer Music (1996) Gallery Sound Installation (1997) Automatic Spatialization and Pre-recorded Trajectories

    5. CURRENT RESEARCH AND THE COMPOSITION, CIRCUMFUSION 54

    Research Goals Rotating Sound Fields Current Research: MAX Patches Diffuser Patcher HowMove Patcher Circle Generator Bouncer Patcher onOwn Patcher MouseTracker Patcher RouteMaker Patcher AutoSwitcher Patcher ctl2radii Patcher Doppler Patcher Octo Patcher Speakers Patcher Locations Patcher Loudspeaker Patcher Vertical Spatialization Pitch and Program Selection EventRouter Patcher Patcher My Plot K2500 Programs Circumfusion: The Composition

    APPENDIX A

    APPENDIX B

    APPENDIX C

    REFERENCES

    IV

  • LIST OF TABLES

    T a b l e Page

    1. Four permutations of multitrack and multi-loudspeaker setups 5

    2. Interaural Phase Differences for selected frequencies 17

    3. Ratios for adjacent versus diagonal distances 34

    4. Four multichannel spatialization systems 36

    5. MIDI channel mapping for Octophonic Etude I. 42

    6. MIDI channel mapping for K2000 43

    7. MIDI Pan values and locations 45

    8. K2500 Output panning

    9. Automatic spatialization groups

    10. Soundfiles used in Circumfusion 94

  • LIST OF FIGURES

    Figure P a g e

    1. Path length differences from sound event to each ear 16

    2. Cone of confusion

    3. Line drawing of the ear, after Keidel W.D. et al 19

    4. Inverse Square Law 22

    5. Local and global reverberation 24

    6. Intensity panning 27

    7. Four-Channel Intensity p i l i n g 28

    8. Four- and Eight-Loudspeaker distances 34

    9. MEIT loudspeaker setup

    10. MIDI pan limitations 44

    11. Virtual loudspeaker coordinates on a computer screen 51

    VI

  • LIST OF EXAMPLES

    Example P a g e

    1. Six virtual sound sources represented as greyscale dots 53

    2. Diffuser Patcher

    3. Mover Patcher

    4. CircleMaker Patcher 62

    5. Bouncer Patcher 64

    6. onOwn Patcher 65

    7. MouseTracker Patcher. 66

    8. RouteMaker Patcher. 67

    9. AutoSwitcher Patcher 69

    10. ctl2radii Patcher

    11. 2D table for generating two circles 71

    12. Doppler Patcher

    13. Octo Patcher

    14. Speakers Patcher

    15. Locations Patcher

    16. Loudspeaker (LS) Patcher 77

    17. Vertical Patcher

    18. EventRouter Patcher

    19. NoteMaker Patcher 80

    20. Program Change Patcher

    21. MyPlot Patcher

    22. Plotter Patcher

    vn

  • 23. MyLCD Patcher g3

    24. Galaxy+ Editors picture of K2000/K2500 Program 84

    25. Galaxy+ screen of layer one g7

    26. Function generator in the K2500 g9

    27. LFO rate page 90

    vrn

  • CHAPTER I

    INTRODUCTION

    Overview

    The present dissertation, Circumfusion, is a single-movement computer music

    composition for real-time sound diffusion system. The piece has an eighteen-minute

    duration and is in two sections. The goal of the dissertation is to explore the creation of

    the real-time sound diffusion system. This system is comprised of a software application

    developed in the MAX environment and the Kurzweil K2500 sampler. The software

    allows composers to control the spatial location of soundfiles reproduced by the K2500

    over eight loudspeakers. To increase the effectiveness of the spatialization, several

    aspects of human localization were taken into account. These localization cues were

    implemented whenever the design of the K2500 algorithms permitted the use of

    appropriate delays and digital signal processing filters.

    This introduction provides an explanation of pertinent terminology as well as a

    brief overview of sound spatialization practices in music of the past and present. The

    following chapters furnish information on localization cues, current research, early work

    in eight-channel systems and the development of the software used for the present

    composition, Circumfusion. The final chapter concludes with a description of the timbres

    and compositional choices made during the creation of this piece.

  • Terminology

    Computer music and sound spatialization practitioners often use specialized and

    unfamiliar terms. Many terms occur in pairs of classifications that delineate one set of

    circumstances from another. To clarify their usage in this dissertation, the following

    section contains definitions and explanations of these terms.

    The act of determining the location of a sound is usually performed

    unconsciously. When discussing this process, it is helpful to differentiate between the

    perception and the object of perception. The expression "sound event" refers to an

    acoustic event. The expression "audio event" refers to the human perception of that

    event.1 A sound event might not coincide with the perception of that event because of

    limitations in the perceptual mechanisms and abilities of the observer. Other

    psychoacoustic terms illustrate this point. For example, amplitude is a physical quantity

    and often differs from the human perception of loudness. In the same way, the

    perception of pitch does not always agree with a sound's measured frequency. Research

    into human sound localization is constantly comparing the location of a sound event with

    a human test subject's perception of that sound event's location.

    In addition to the conscious awareness of an audio event and its spatial location,

    test subjects also demonstrate unconscious reactions to a sound event's location. One of

    these, nystagmus, is the subtle movement of the eyes towards the perceived location of a

    sound. By measuring these eye movements, researchers gather objective evidence of

    sound localization and depend less on the ability of the subjects to articulate perceptions

    of audio events. In this respect, nystagmus is an important aspect of the autonomic

    responses to rotating sound fields and is discussed later in this dissertation.

    ljens Blauert, Spatial Hearing: The Psychophysics of Human Sound Localization (Cambridge, Massachusetts: MIT Press, 1997), 2.

  • "Localization" is the ability to determine the origins of a sound event in the free

    field. This "free field" describes sound events originating from loudspeakers or from

    acoustic sources spatially removed from the listener. In contrast, the term "lateralization"

    specifies localization in a "closed field" setup such as headphones. The teim

    "spatialization" refers to the process of adding a spatial component to sound events. This

    involves either providing the listener with localization cues or physically placing the

    source of the sound events in different locations. "Auralization" is the process of

    providing the necessary psychophysical cues to recreate the "sound" of a listening

    environment, e.g., a room, concert hall, stadium, etc.2

    Computer music practitioners use the terms "real-time" and "non real-time" in a

    variety of ways. These uses depend on whether or not a system generates

    results. In most instances, the term real-time indicates the ability to produce a desired

    output immediately upon receiving input In order to generate this real-time output, it is

    often necessary to adjust parameters in non real-time. Since real-time systems are made

    up of non real-time elements, it is necessary to clarify which qualities of a system are

    real-time and which are not In this dissertation, real-time refers to the ability to change

    the spatialization of a sound event while it is sounding. This excludes common intensity

    panning between two loudspeakers. This dissertation covers two types of non real-time

    spatialization. The first is the process of rendering multiple soundfiles, recording these

    files to multi-track tape and later playing the tape over multiple loudspeakers. The

    second type is 3-D sound signal processing that is too CPU-intensive for the computer to

    perform while it is receiving the sound as input. This processing must be done offline to

    create a complete soundfile for later playback.

    * 2Durand R. Begault, 3-D Sound For Virtual Reality and Multimedia (Boston: AP Professional, 1994), 180.

  • Similarly, the term "stereo" often has different meanings as well. While the term

    "stereophonic" still connotes the use of time delays that simulate human binaural hearing,

    the term "stereo" has evolved to include all systems containing two loudspeakers. In the

    latter sense, a monophonic signal reproduced over two loudspeakers would qualify as

    "stereo" even though it provided no spatial depth to the sound events. In this dissertation,

    "stereo" will take on its original meaning and imply the presence of directional cues.

    The final area of terminology is in the various combinations of the terms

    "multitrack", "multichannel" and "multi-loudspeaker". Since "stereo" usually means two

    loudspeakers, multi-loudspeaker designates more than two loudspeakers (commonly

    multiples of two). Similarly, multitrack tape designates more than two tracks. Multitrack

    tape pieces are commonly on four- or eight-track tape. A composer could choose to route

    each track to a different loudspeaker or choose to mix all of the tracks to two

    loudspeakers. Conversely, it is the practice of present-day sound diffusionists to perform

    their two-track stereo compositions over many independently amplified loudspeakers. In

    either case, reproducing sounds from a stereo or multitrack tape3 over a stereo or multi-

    loudspeaker playback system requires a "multichannel" mixing console. On these

    consoles "channels" refer to both inputs and outputs. Almost all mixers have anywhere

    from two to forty input channels and many commonly have two, four or eight output

    channels. These output channels, or "busses", are the means by which performers route

    input signals from tape decks or computers to amplifiers and loudspeakers.

    When discussing sound spatialization over multiple loudspeakers, it is important

    to distinguish between the ability to spatialize multiple unique sound events as a stereo

    group and the ability to spatialize independent sound events separately. One of the goals

    3The term "multitrack tape" includes multiple output computer soundcards that function as "virtual" multitrack tape machines.

  • of this dissertation is to allow independent spatialization of up to six simultaneous sound

    events over eight loudspeakers.

    Four permutations are possible involving two-channel stereo/multitrack (input

    stage) and two-channel stereo/multi-loudspeaker playback (output stage). A mixing

    console illustrates these differences with its fixed number of input and output channels.

    For example, using a mixer, one can duplicate the signals on a two-track tape so that they

    appear at each of the eight independent output channels. Conversely, one can mix a 16-

    track tape down to two output channels.

    Number of Tracks Number of Loudspeakers Abbreviation 2 (stereo) > 2 (multi-loudspeaker) SM

    > 2 (multitrack) > 2 (multi-loudspeaker) MM 2 (stereo) 2 (stereo) SS

    > 2 (multitrack) 2 (stereo) MS

    Table 1: Four permutations of multitrack and multi-loudspeaker setups.

    These permutations do not include differences in the control of the sound spatialization.

    A musician diffusing a tape composition might not have control over the spatialization of

    individual tracks or sound events. For example, even though a composition may contain

    four layers, if the composer "flattens" these four layers by recording them onto two-

    channel tape, then it is no longer possible to spatialize them independently. In most

    instances, composers wait until later in the compositional process before mixing and

    combining sound material. By waiting, they preserve the ability to spatialize the

    individual layers of a composition. This also holds true for spatialization in live

    performance.

  • Background of Music Spatialization

    The following section provides a brief historical overview of spatialization in

    music of the past by citing a selected number of works and discussing their significance.

    In order to trace quickly the evolution of the ideas of sound location and movement, the

    author has chosen only a single example from each century. Therefore, this selection is

    not inclusive of all examples of spatialization in music.

    The beginnings of sound spatialization - as an integral element of a composition -

    appear as early as the 16th century with the music of the cori spezzati. This practice of

    composing for two spatially separated choirs originally developed from psalmody. In

    1550, Adrian Willaert's salmi spezzati for double choir inspired a new period of

    popularity for this genre. These pieces, often written for special occasions, were simpler

    in structure and harmonic design than their non-polychoral counterparts. This was due,

    in part, to the increase in difficulty and the composition's shift of interest from the

    harmonic to the spatial.

    Two works, separated by almost a century, mark important moments in the

    evolution of the poly choral style. In both cases, the number of singers and spatially

    separated choirs is noteworthy. The first is Thomas Tallis' Spem in alium (1578), a 40-

    voice motet, written for eight five-voice choirs. While the exact motivation for

    composing this unprecedented work is not known, there is speculation that it was perhaps

    for the celebration of Queen Elizabeth's 40th birthday.4 The second piece is the 53-voice

    Missa salisburgensis (1682) now attributed to either Heinrich Biber or Andreas Hofer.5

    4Dorothy Regina Holcomb, "Tallis, Thomas," The New Grove Dictionary of Music and Musicians, 20 vols., ed. Stanley Sadie. (London: Macmillan, 1980), XVIII, 544.

    5Elias Dann, "Biber, Heinrich Ignaz Franz von," The New Grove Dictionary of Music and Musicians, 20 vols., ed. Stanley Sadie (London: Macmillan, 1980), II, 681.

  • Denis Taylor describes the simple character of this piece in his article on "Cori spezzatf'

    in the New Grove Dictionary of Music and Musicians.

    The performers are divided into eight groups, all constituted differently. The chordal structures are extremely simple, and the whole depends strongly on the element of surprise provided by the spatial separation.6

    The fixed location of each group of singers is an important characteristic of the

    Venetian music of composers such as Willaert and Andrea Gabrieli. Spatial separation,

    more so than changes in location or distance, played the primary role. Composers

    employed spatialization as a special effect rather than as a fundamental compositional

    principle. These effects augmented the dialog7 of the text but were not used as musical

    themes or motives intended for independent development. This trend continued past the

    music of the cori spezzati and later examples of spatialization until the development of

    electronic music in the twentieth century.

    A nineteenth-century instrumental example is the use of the on-stage and off-

    stage shepherd horn calls in the third movement of Berlioz's Symphonie Fantastique

    (1830) entitled "In the Country". In this case, the oboe and English horn — representing

    the shepherd horns - are involved in a dialog augmented by their spatial separation. The

    off-stage oboe has a different timbre due to its location Derriire la seine.* This spectral

    change is symbolic and is not as dramatic as what would occur naturally, just as the

    instruments are not intended to be thought of as "real" shepherd horns. The spatial

    separation of the two soloists, as part of the musical narrative, does not change as a

    function of dramatic developments during the section. The two shepherds remain

    6Denis Taylor, "Cori spezzati," The New Grove Dictionary of Music and Musicians, 20 vols., ed. Stanley Sadie (London: Macmillan, 1980), IV, 776.

    7 By "dialog" I refer to a musical echo of ideas, call and response, responsorial, or operatic duet.

    "Hector Berlioz, "Symphonie Fantastique," ed. Paul Henry Lang (New York: W.W. Norton, 1969), 190.

  • separated by the same distance until the end of the section in bar 20, at which point the

    oboist is directed to return to the orchestra. In one sense it is logical for the two

    shepherds to remain separated since a "real" shepherd would not change position while

    conversing with another shepherd some distance away. However, considering the

    fantasy-like nature of this symphony, this argument would be a tautology. An

    opportunity for presenting the formerly separated horn calls from a closer distance anses

    at the end of the movement when the English horn again calls, this time with no answer.

    Thus, even when distance change was possible, Berlioz did not utilize it. This argument

    is not mean, to second-guess the composer, but rafter to illustrate the possible lack of a

    paradigm for using sound spatialization as a musical motif.

    An example from the twentieth century comes from Charles Ives' The

    Unanswered Question (1906) which utilizes off-stage strings. In this example, the dialog

    is no, augmented with spatialization, bu, rafter uses more traditional motivic and

    orchestrational techniques. What is noteworthy, however, is fta, the questioning moff ,n

    the muted trumpet and the "answer attempted by the flutes represents an act.v,ty

    occurring in an "environment" created by the spatially separated strings. These stnngs

    represent "The Silence of the Druids - Who Know, See and Hear Nothing."' The

    listener's familiarity with a timbre - in this case a stiinged instrument - is an important

    factor in a listener's perception of distance. Witt, present-day electroacoustic music,

    many of the timbres are newly constructed and therefore unfamiliar. Therefore, it is more

    difficult for composers to create perceptions of virtual distance.

    In order for distance to rake on a theniuic role in a composition, it is necessary for

    one of the dialog participants to change location during the course of the duet - such as

    'Charles E. Ives, "The Unanswered Question," (New York: Southern Music

    Publishing, 1953), "Foreword".

  • from far to near. Space would then play a developmental role in the piece. While many

    musical examples exist in which a musical idea is presented from a distance, fewer exist

    in which the distance changes. Exceptions include the staging and blocking of opera

    productions. In this case, performers are instructed to change location during a piece.

    However, while these instances may be suggested by the music, they often are based

    entirely on the control of the opera production's artistic director and not the composer.

    Electroacoustic Works Employing Spatialization

    With the advent of electroacoustic music in the twentieth century, the conditions

    of presenting a spatial composition changed drastically. Formerly, spatialization was

    dependent on the location of performers. With electroacoustic music, the only limitation

    on spatialization is the number of loudspeakers and where they can be positioned. In

    addition, since a composer can choose to project the electronic sounds from one

    loudspeaker or another, space and movement become compositional options. However,

    this new freedom can also be viewed as coming at the expense of the "drama"

    experienced with a live performer.

    In 1956, Karlheinz Stockhausen was the first composer to use multichannel

    playback in his composition Gesang der Jiinglinge. In this piece, Stockhausen serially

    distributed sounds over five groups of spatially separated loudspeakers.10 This was the

    first time that location was used as an element of serialization11 and conceivably raised

    the awareness - if not the stature - of spatialization in the minds of many. The following

    year, Edgard Varese utilized spatialization techniques in his multitrack/multi-loudspeak'er

    piece Poeme electronique (1957-58) composed for the Brussels World's Fair. Although

    10G.W. Hopkins, "Stockhausen, Karlheinz," The New Grove Dictionary of Music and Musicians, 20 vols., ed. Stanley Sadie (London: Macmillan, 1980), XVIII, 152.

    "Ibid.

  • 10

    the original format of this piece is lost, it reportedly consisted of three tracks of audio

    tape diffused over more than 425 loudspeakers.

    Many composers continued to make use of multi-loudspeaker systems in the

    1960's for live electronic and tape pieces. The compositions of one composer, Morton

    Subotnick, are significant to this dissertation because of his work with rotating sound

    events. Through hardware that he developed himself, Subotnick was able to quickly

    spatialize sound rotating through four loudspeakers. At certain speeds, the rotation itself

    would introduce new frequencies - presumably through amplitude modulation.

    John Chowning's four-channel composition Turenas (1972) and subsequent paper

    The Simulation of Moving Sound Sources (1977) were hallmarks of more recent efforts to

    spatialize sound. Chowning differed from others in that he incorporated several "cues" to

    sound localization. In addition to the intensity panning Chowning used for moving the

    sound between the four loudspeakers, he also included distance cues (local and global

    reverberation) and velocity cues (Doppler shift). Chowning influenced the aesthetics of

    electroacoustic music by allowing his choices of timbre to be influenced as much by

    psychoacoustics as by musical considerations. For instance, he took advantage of the fact

    that percussive sounds were easier to localize than long sustained sounds.12

    Sound Diffusionists

    The "diffusionist" approach to electroacoustic music spatialization began in the

    1970s and is closely associated with the French and Canadian acousmatique school of

    composition. This school emphasizes musique concrete and the practice of spatially

    "performing" pre-recorded tape compositions during a concert. Performance spaces,

    12Charles Dodge and T.A. Jerse, Computer Music: Synthesis, Composition, and Performance (New York: Schirmer Books, 1985), 256

  • 11

    such as the Acousmonium", often have as many as several dozen loudspeakers. During a

    concert, composers diffuse their own compositions as well as the compositions of other

    composers. The following paragraphs illustrate common aspects of sound diffusion and

    are not applicable to all diffusion-related pieces.

    Diffusing stereo tape pieces over multiple loudspeakers involves two principal

    considerations. The first is the result of assembling a large number of loudspeakers.

    Many of the loudspeakers are often different in design and construction. The type and

    number of elements (woofers, tweeters, crossover circuits, etc.) as well as the crossover

    frequencies between elements gready affect the response of a loudspeaker. Because each

    manufacturer uses different crossover frequencies for the elements making up a

    loudspeaker, each loudspeaker will possess a unique frequency response. Due to the

    varying frequency responses of the different loudspeakers, "inherent" sound diffusion

    occurs when reproducing tape pieces over multiple loudspeakers. For example, if one

    loudspeaker is more sensitive in a band-limited range of frequencies than another, then

    timbres that contain frequencies in the same range will "pop" out of that loudspeaker

    instead of another. This will cause groups of sound events that differ in spectral makeup

    to sound louder at some loudspeakers than others. The result of this latent intensity

    panning is that differing groups of timbres that are mixed together on a stereo tape sound

    as if they were routed from a multitrack tape to uniquely positioned loudspeakers.

    Therefore, a large assemblage of loudspeakers is more than simply a large "stereo" setup,

    i.e., the whole is greater than the sum of its parts with respect to the number of

    independently positioned sound sources.

    13 The Acousmonium in Paris was developed by the Groupe de Recherches Musicales.

  • 12

    A second aspect of diffusion concerns understanding how listeners localize

    sounds. Once listeners localize a sound event, they are likely to continue localizing it at

    the same point, even if the intensity panning changes slightly. Diffusionists use intensity

    panning to "place" sound events at a sub-group of the loudspeakers. Next, they gradually

    increase the level of other loudspeakers. Listeners then localize new sound events at the

    recently boosted loudspeakers. In this way, the two-channel tape functions as a

    multichannel source. This technique works especially well with a combination of

    sustained sounds and short impulse-like sounds. Since the sustained timbres are more

    difficult to localize, they have a tendency to remain fixed at the point where a listener

    first localizes them. Shorter and more percussive sounds are easier to localize and,

    consequently, a listener will localize them at the location that reflects the most recent

    intensity panning. Thus, two tracks of tape can yield multiple output channels for

    supplying inputs to loudspeakers.

    These two factors - varying frequency responses of multiple loudspeakers, and a

    familiarity with how the ear localizes sound events - allow acousmatique composers of

    two-track stereo compositions to create live multi-loudspeaker pieces during

    performance. Two performance approaches have arisen from these techniques: blending

    and moving. In the "blending" approach, the performer seeks the optimal balance of the

    loudspeakers allowing the different timbres to appear from different loudspeakers. In the

    "moving" approach, the performer seeks not only to emphasize the left-right panning that

    already exists on the two-channel tape, but also to introduce new velocity effects, such as

    back to front.

    These techniques have also had an aesthetic influence on the compositional

    choices involved with electroacoustic pieces intended for diffusion. Besides the

    aforementioned reliance on the juxtaposition of sustained sounds versus sound events

  • 13

    with shorter, percussive timbres, diffusionists also frequently use timbres that quickly

    build to a sharp percussive "thump." This effect is similar to the sound of reversing the

    waveform of a heavily reverberated percussive sound. Because diffusionists rely on

    percussive sound events, many diffusionists make use of musique concrete for source

    material. They avoid the long timbres created through time-stretching techniques such as

    the Fourier Transform, Linear Predictive Coding and granular synthesis.

    The sound diffusionists spatialize their compositions by independently controlling

    the level of input to each loudspeaker. In order to route the tape signal to each

    loudspeaker while maintaining independent control of each loudspeaker, it is necessary to

    first split the signal from two outputs (stereo) into multiple outputs with a distribution

    box. These devices are built with resistors and electrical wire. By tying multiple 1 kilo-

    ohm resistors from each output of a tape playback device to each of several additional

    outputs, a single signal can supply the multiple input channels of the mixer with the

    identical signal. It is necessary for the mixer to have either multiple bus outs or,

    preferably, direct outs on each channel. These direct-outs are outputs whose level is

    controlled by the fader for that channel; they change an input into an output. Using these

    direct outs, an eight-bus mixer could, in theory, have 16 or even 32 outputs as long as the

    user simply needed to be able to route a single signal out of each direct-out. For the

    diffusionists, each direct out supplies the input of a single channel of an amplifier which,

    in turn, powers a single loudspeaker. This author has built such a distribution box and

    has successfully used it during diffusion concerts involving sixteen and twenty

    loudspeakers.

    Despite the success of this approach to sound diffusion, there are limitations with

    the amount of control one has when spatializing sounds. For example, it is not possible

    to move a timbre smoothly from one loudspeaker to another independently of the other

  • 14

    timbres on the tape. Some composers utilize these diffusion techniques with multitrack

    instead of stereo tapes. The multiple tracks permit more control of blending during the

    performance by preserving the separation of sound events on separate tracks. In these

    cases, the sound events are not always panned independently from track to track such as

    in two-channel stereo intensity panning.

    Summary

    A fundamental aesthetic difference exists in the role of spatialization of music

    before and after the development of electronic music. In the latter, space is used

    thematically with respect to the presentation of musical ideas. These spatial motifs

    involve techniques such as stating the sound events at different apparent distances from

    the listener, moving sound events and changing environments (e.g. large halls evolving

    into small rooms, or "moving walls").

    Along with this enhanced ability to control the apparent location of sound events

    comes the dilemma of "who" in the listening audience is able to appreciate it. With

    music of the past, for example, everyone hearing the piece could appreciate the off-stage

    instrumentalist. On the other hand, many computer music spatialization techniques are

    built on assumptions of precise audience location. The result is that the spatialization in

    many "concert" computer music pieces is more easily appreciated with headphones. This

    fact is nowhere more evident than in concerts of electroacoustic music where audience

    members attempt to sit as close as possible to the center of the seating area in order to

    gain a place precisely centered between the left and right loudspeakers.

  • CHAPTER II

    PSYCHOPHYSICS OF HUMAN LOCALIZATION

    Human sound localization has occupied a significant portion of psychoacoustic

    research throughout the twentieth centrny. These efforts have increased during the last

    two decades with the advent of virtual reality research. To create a virtual environment,

    it is necessary to simulate the acoustic space (auralization) as well as the stationary and

    moving sound sources within those virtual acoustic spaces (spatialization). While much

    of this past research has focused on 3-D audio, many of the insights into the mechanisms

    of localization are relevant to multiple loudspeaker spatialization systems. This chapter

    summarizes the basic principles of human sound localization as they relate to computer

    music spatialization systems.

    Localization Cues

    Sound localization is primarily achieved with the ears.14 Because of their location

    on either side of the head, the two ears sample the acoustic space differently. Differences

    in arrival time, intensity and spectrum help the brain determine the location of a sound by

    providing "cues" to a sound event's location. The brain derives these localization cues

    by comparing the sound received at each ear. This process is called "binaural hearing"

    and is analogous to stereoscopic vision. Thus, in addition to perceiving the content of a

    sound event, a listener also receives information about the location of the sound event.

    14 r Exceptions include some limited localization by means of the hairs on the back of the neck.

    15

  • 16

    The following figure illustrates the differences in what each ear perceives when a

    sound event is located off of the midline axis. The path along "a" is shorter than the path

    along "b".

    s i ' midline, 0°

    Y-v \ vb /

    a \ / "x sO\

    r 1

    sO\

    r

    Figure 1. Path length differences from sound event to each ear.

    The first result of the path length differences is the Interaural Time Difference

    (ITD). This localization cue contains two parts: interaural onset differences for short

    sounds and interaural phase differences (IPD) for continuous sounds. The two factors

    determining onset differences are the speed of sound (344 m/sec) and the extra distance a

    sound event must travel from one ear to the next. Sounds located 90° from either side of

    the midline axis have the maximum interaural distance. Assuming an average head

    diameter of 17.5 cm15, the maximum interaural time difference can be calculated from the

    following formula.

    time difference = .175 meters + 344 meters/second = .0005 second

    This time difference is an important factor in localizing short, percussive impulse sounds.

    For sound events with slow attacks - those without strong transient characteristics - the

    15Simon Carlile, Virtual Auditory Space: Generation and Applications, (Austin-R.G. Landes Company, 1996), 35.

  • 17

    hearing mechanism must compare the phase differences at each ear. The interaural phase

    difference ranges from 0° to 360° based on the diameter of the head and the wavelength

    of the sound. For example, a sound event with a frequency of 344 Hertz has a

    wavelength of one meter. If this sound is located 90° off the midline, it will "lead" the

    opposite ear by a phase difference of 63°. As the frequency increases, the wavelength

    becomes shorter until one cycle is equal to the diameter of the head. At even higher

    frequencies the sound will complete one cycle and begin another before reaching the

    second ear. The following chart lists several frequencies and their IPDs for sound events

    located 90° off the midline.

    Frequency Wavelength (360° * .175 ) + X = A0

    / = 344 Hz; X = 1 m 63 -i-1 = 63° f = 688 Hz; X = .5 m 63 -5- .5 = 126° f = 1376 Hz; X = .25 m 63 + .25 = 252° / = 2752 Hz; X = .125 m 63 + .125 = 504°

    Table 1: Interaural Phase Differences (A°) for selected frequencies.

    This table demonstrates that the IPD cannot function as a localization cue for

    frequencies above 1.5 kHz. Above this frequency, the phase differences have increased

    beyond the 360° of a single cycle. The brain is incapable of determining which ear is

    leading the other with respect to phase. Therefore, the brain relies on IPDs for sounds

    below 1.5kHz and relies on ITDs for sounds above 1.5 kHz. This model is called the

    "Duplex Theory" and was the principal paradigm for human localization for many

    years 16

    Despite the general success of the Duplex Theory, it fails to explain sound

    localization in some cases.17 One such instance is the "cone of confusion". This model

    16Carlile, 28. "Ibid.

  • 18

    represents a cone extending outward from each ear. Sound events that originate from a

    point on this cone yield identical interaural time differences at the opposite ear.

    Figure 2. Cone of confusion.

    Without a visual aid, listeners sometimes localize the sound as originating from the

    incorrect side of the cone. This misperception is called the "front-back error". Most

    listeners still are able to correctly localize these sound events at a better-than-chance rate.

    This indicates that other cues - besides those related to the Duplex Theory - play a part.

    One such cue is the difference in sound pressure levels between the two ears.

    This Interaural Intensity Difference (IID) is also called the "obstacle" or "baffle effect".18

    High frequencies are more susceptible to this attenuation since their short wavelengths

    are readily absorbed or blocked by the head. This frequency-dependent attenuation

    produces what is called the acoustic "shadow effect". Low frequencies remain

    unaffected since their long wavelengths allow them to refract around the head with little

    attenuation.19 In addition to the attenuation of sound pressure levels at the far ear, there is

    also an increase in intensity at the closer ear. This frequency-dependent boosting is due

    to the reflection of sound off of the head and can amount to as much as 10 dB in some

    18Carlile, 29. 19Ibid.

  • 19

    instances. Other localization cues include the spectral shaping a sound undergoes as it

    interacts with the convolutions of the outer ear, the pinna. This is the principal factor in

    vertical localization and is discussed in the following section.

    The Head Related Transfer Function and Vertical Localization

    The shape of the outer ear as well as the reflective properties of the shoulders and

    upper torso play an important role in spectrally shaping a sound event that originates

    from above the listener. Unlike ITDs and the resulting cone of confusion, the filtering

    effects of the outer ear convolutions are unique to each position in the three-dimensional

    space around the listener. While this filtering is present in the perception of sound from

    any direction, it is a critical aspect of vertical localization. These "spectral cues" are

    unique to each ear and are not limited to binaural localization. Therefore, persons

    possessing only monaural listening can also localize sound. This ability is degraded,

    however, when the folds of the pinna are filled. Figure 3 depicts the contours of the

    pinna on the following page.

    Lobule (pf)

    " ~ Helix (pf) — Fossa of Helix

    ~ Antihelix (pf)

    Cymba (concha)

    Crus helias

    Cavum (concha)

    Antitragus

    Figure 3. Line drawing of the pinna, after Keidel W.D. et al.

  • 20

    The convolutions of the pinna provide echoes of the direct sound reaching the

    eardrum. These short delays act as a comb filter to the original signal, producing

    alternating peaks and troughs in the frequency response of the ear for each location.

    Slight head movements often result in drastic changes in this filtering. Taken as a whole,

    the effects of the outer ears, shoulders, upper torso and head are collectively called the

    Head Related Transfer Function (HRTF).

    Since the head and pinna shape the spectrum of sound events based on the

    sound's location, it follows that listeners can better localize those sounds containing more

    spectral energy. Spectrally rich broadband sounds such as noises and impulses are easier

    to localize than harmonically simpler sounds such as pure tones. Some mating and

    warning signals illustrate this point. For instance, some species of bird employ pure

    tones when alerting other birds to danger in order to avoid revealing their own location.

    These same birds use spectrally rich mating calls so they can be easily located.20 The

    HRTF will be discussed further in the following chapter on spatialization systems.

    Moving Sound Sources

    Velocity cues enable the listener to perceive and track a sound event's motion.

    For a moving sound event there are changes in the ITDs and the sound event's spectrum.

    In addition, some frequency changes can also occur. Measuring the Minimum Audible

    Angle (MAA) yields the just-noticeable difference for changes in a sound event's

    location. This is analogous to measurements of the just noticeable differences in

    frequency. For those sound events that the listener cannot see, the MAA in front of the

    listener is within 1°. For sound events to the sides and behind the listener, the MAA rises

    to between 6° and 9°. For sound events off the midline, there are subtle eye movements

    20Douglas Webster, Richard Fay, and Arthur Popper, eds., Evolutionary Biology of Hearing, (New York: Springer-Verlag, 1992), 694.

  • 21

    toward the direction of the sound source. These uncontrolled eye movements, called

    Nystagmus, aid the listener in quickly determining the source of a sound by directing the

    eyes even before the head moves. Once facing the general direction of the sound event,

    the listener can benefit from the higher resolution localization ability as well as the visual

    receptors. A second measurement, the Minimum Audible Movement Angle (MAMA),

    shows that motion detection is also more accurate in front of the listener.

    There are two types of sound event velocities relative to the listener: angular and

    radial. Angular velocity occurs when a sound event maintains a constant distance by

    moving in a circle around the listener.21 This kind of motion is important to this

    dissertation since the rotating sound fields are good indicators of the effectiveness of the

    spatialization. For instance, effective spatialization of rotating sound events should cause

    a sense of illusory self-rotation in listeners who have closed their eyes or are in a dimly lit

    space.

    The second type of motion, radial velocity, occurs when a sound event changes

    distance relative to the listener. For these sound events there is a shift in frequency called

    the Doppler Effect. For sounds moving closer to the listener, successive wavefronts are

    compressed resulting in an increase in frequency. For sounds moving away from the

    listener, successive wavefronts are spread out resulting in a longer wavelength and a

    lower frequency. The formula for the Doppler shift is given below.

    f0 = fs * (v - vQ)/(v-vs)

    V is the velocity of sound in the medium (344 meters/second for sound in air), v0 is the

    velocity of the observer relative to the medium, vs is the velocity of the source relative to

    21Curtis Roads. The Computer Music Tutorial, (Cambridge, Massachusetts: MIT Press, 1996), 464.

  • 22

    the medium, and fs is the frequency of the source when stationary.22 This produces a

    difference in pitch of a semitone for every 42 m.p.h. in relative speed. Since the

    frequency shift works with sounds moving nearer and farther away, a sound source

    traveling at 42 m.p.h. toward the listener will drop by one whole-step in perceived pitch

    once the sound source passes the listener.

    Distance Cues

    There are four principal cues to a sound event's distance: echoes, reverberation,

    overall attenuation and high-frequency attenuation. Sound event intensity diminishes

    with distance according to the Inverse Square Law. For example, at a distance of two

    feet from a sound source, the intensity is 1/4 of the sound intensity at a distance of one

    foot. At a distance of four feet the intensity is l/16th the intensity at one foot.

    Distance

    Figure 4. Inverse Square Law.

    22Barry Truax, editor. Handbook for Acoustic Ecology, (Vancouver. A.R.C. Publications, 1978), 38.

  • 23

    Using this principle, spatialization systems attenuate the sound intensity

    exponentially as a sound event is spatialized to appear further away from the listener.

    Research using subjective evaluations of distance has also shown that the inverse "cube"

    is often more effective.23 Higher frequencies are further attenuated because of the

    absorbency of humidity.

    The effects of the listening environment on the sound event provide additional

    distance cues. These effects are divided between the early echoes and the reverberation

    that follows. The early echoes are known as the initial or primary wavefront and play a

    part in localization. The precedence or "Haas" effect states that the ear will localize a

    sound in the direction of the earliest arriving sound event even if the initial echoes

    arriving at the other ear are louder. Larger spaces yield larger echoes.

    After the initial echoes, a dense fusion of thousands of other echoes arrives at the

    listener. This fusion, known as reverberation, is the most important factor in the

    perception of the auditory environment. It has also played an important role in computer

    music research. Computer music composers often simulate abnormally "dry"

    environments through means of anechoic chambers, recording studios, and sounds

    created through software synthesis. The composer adds reverberation to sound events in

    order to make them appear more natural. In this way the composer controls the virtual

    auditory space of the composition. Early techniques included reverberation chambers

    and spring reverberation units. Recent techniques include digital signal processing

    software and outboard devices.

    Composers adjust reverberation parameters to simulate the characteristics of an

    acoustic space. For example, increasing the high frequency damping simulates the

    23F. Richard Moore, Elements of Computer Music, (New Jersey: Prentice Hall, 1990), 370.

  • 24

    effects of absorbent walls and surfaces. Increasing the initial delay of the reverberation

    simulates a larger room size. A third parameter - the ratio of the original sound level to

    the simulated reverberation level - is one of the principal cues for distance perception.

    The intensity of the primary sound relative to its reverberation is higher for nearer sound

    events. For distant sound events, the ratio of the reverberation to the original sound is

    more equal. Like intensity, research has shown that exaggerated reverberation levels

    often result in more convincing evaluations of distance.24 The following diagram

    illustrates the differences between the two types of reverberation. The pathway labeled

    "near" shows that a sound event originating near the listener will be much louder than the

    global reverberation arriving later. The pathway labeled "far" shows that the distant

    sound event and its local reverberation must travel a similar distance.

    O Figure 5. Local and global reverberation.

    Charles Dodge writes of these two reverberations:

    When the sound is located close to the listener, most of the reverberation is global. This models an imaginary environment in which the listener is located in the center of a space with reflection in all directions. If the sound source is distant, most of the reverberation is local and comes from the same direction as the source because, in the imaginary environment, the source is closer to reflectors in that direction.25

    24 lIbid. "Charles Dodge and T.A. Jerse, Computer Music: Synthesis, Composition, and

    Performance (New York: Schirmer Books, 1985), 243-4.

  • 25

    Summary

    This chapter has provided a brief overview of the mechanisms of human sound

    localization. Over the years, psychoacousticians have refined all of the general principles

    presented in this chapter. This overview, however, has concentrated on the most

    common localization, velocity and distance cues as they relate to the spatialization

    system in the present research.

  • CHAPTER ffl

    CURRENT SOUND SPATIALIZATION TECHNIQUES

    Two categories of spatialization systems are in use today: multi-loudspeaker and

    HRTF systems. These spatialization systems exhibit differences that result in advantages

    and disadvantages that are unique to each system. Furthermore, there are many possible

    multi-loudspeaker configurations and approaches to using multi-loudspeaker systems. In

    order to clearly highlight the primary differences between HRTF and multi-loudspeaker

    systems, this discussion of multi-loudspeaker spatialization will focus on individually

    spatializing multiple sound events over eight loudspeakers. This discussion will also

    show that multi-loudspeaker spatialization is better suited to electroacoustic music

    diffusion in concert settings. Three recent multi-loudspeaker spatialization systems will

    be compared including how they differ from the present research.

    Multi-loudspeaker Spatialization Systems

    Most spatialization systems involve free-field loudspeakers positioned around the

    audience. These loudspeakers usually rest at floor level or on stands slightly above

    audience level. The minimum number of loudspeakers necessary to surround the

    audience is four. Once this field is established, other loudspeaker locations can include

    positions above and below the audience. Some loudspeakers face away from the

    audience and towards a wall. This makes the sound events originating from these

    loudspeakers sound more distant since only reflected sound and reverberation reach the

    audience.

    26

  • 27

    Composers spatialize their compositions over the multiple loudspeakers with a

    variety of tools, including multitrack tapes, MIDI devices and mixers for live diffusion.

    When using multitrack tapes or multiple output MIDI devices, composers send each

    output to a separate loudspeaker. Crossfading from one output to the next will move the

    sound events from one loudspeaker to another. Diffusionists using stereo tapes similarly

    crossfade channels on the mixer, each of which is connected to a different loudspeaker.

    In each approach, composers use intensity panning to "move" sound events from one

    loudspeaker to another. Intensity panning works by creating illusory sound event

    locations between loudspeakers. For example, if identical sound events are presented

    from different directions (such as different loudspeakers), the brain localizes the source

    of the sound in the direction of higher intensity. If the intensity is equal at each ear, the

    brain localizes the sound event at a virtual location between the two loudspeakers.

    Changing the relative intensities of the sound at the loudspeakers causes the sound event

    to "move" between the loudspeakers.

    im Track 1

    Track 2

    Figure 6. Intensity panning.

  • 28

    Composers "pan" sound events on tape by placing identical sound events on two

    tracks and crossfading them. Figure 6 on the previous page illustrates intensity panning

    between two channels. The computer sends the upper track to the left loudspeaker and

    the lower track to the right loudspeaker. When the computer plays the sound tracks, the

    sound event will appear to move between the two loudspeakers.

    Multiple loudspeaker intensity panning is achieved with similar crossfades. The

    following figure shows the pattern of crossfades for a sound moving in a circle through

    four loudspeakers.

    t i s Front i / i ininiini ini i i i i mi uiiiHiii in i n i n n i n m n

    [Front

    Back

    R [Back

    innin in i i i i i i inninMnninnmni in in i i i i in i i in in i i innMHUHi innmninnnini i i in inu>i» in in in i in innmiMi inmnni

    Figure 7. Four-Channel Intensity panning.

  • 29

    Intensity panning does not produce stereo imaging even though the sound is

    reproduced over two loudspeakers. True stereophonic imaging requires slight delays

    between the tracks. These delays simulate the ITDs of binaural hearing. Composers

    usually provide these cues with the assumption that the listener will be in a precise

    location between the two loudspeakers. Since the location of the audience is difficult to

    control in concert settings, composers often do not emphasize binaural cues in these free

    field environments.

    Multichannel tape pieces have existed since the works of Stockhausen and Varese

    in the 1950s. During the 1960s some composers used four-channel tapes played back

    over four loudspeakers surrounding the audience. Eight bus mixers and eight-track tapes

    have superseded these four-track tapes. For example, many studios contain an eight-bus

    mixer with at least one eight-track tape recorder such as the Alesis AD AT or Tascam

    DA-88. These configurations do not represent spatialization systems because these

    studios are designed to produce recordings destined for consumer stereo systems.26 As a

    result, these studios have no means of moving a sound event outside of a two-channel

    stereo field.

    Despite these limitations, composers use these multichannel tools to create multi-

    loudspeaker compositions. To move a sound event from one loudspeaker to the next, a

    composer must have some means of crossfading the sound events from track to track.

    For instance, a composer could split a signal to the eight output busses of a mixer and

    manually crossfade the sound to the eight tracks of the tape. There are several limitations

    to this approach. First, the composer could diffuse only one sound at a time. Second, it

    ''The additional tracks and channels allow multiple passes when recording musicians. A recording engineer either chooses between the different takes or chooses to layer several of the takes for the final stereo mixdown.

  • 30

    is impossible to edit the crossfades once they are recorded to tape. Each "performance"

    onto tape would need to be perfect.

    An alternative approach involves using the computer to assist in crossfading

    soundfiles. When finished, the composer could play these soundfiles from the computer

    and record them onto the multitrack tape. There are two categories of computer

    programs in which a composer can accomplish this task: programs that generate

    soundfiles and programs that mix soundfiles. An example of the first category is the

    Csound software synthesis environment. This software allows the composer to produce

    automated multitrack panning by generating the multiple soundfiles all at once. Csound

    supports soundfile formats with one, two or four channels.27 While the composer is still

    responsible for determining the crossfades between the tracks, it is possible to include an

    unlimited number of simultaneous sound events as well as localization cues.

    In order to record the eight soundfiles onto eight-track tape, it is necessary to use

    an eight-channel sound card. Most computers come with only stereo sound output ports.

    Without multichannel sound output ports, the composer would be forced to make

    multiple two-channel recordings in order to record all eight tracks. Some computers

    (Silicon Graphics) and sound cards (Digidesign Audiomedia HI) obtain four-channels of

    output when combining stereo analog and digital outputs. In either case, multiple passes

    of recording, whether from stereo or four-channel sound ports, introduce potential

    synchronization problems.28

    Even with an eight-channel sound card, Csound's four-channel limit prevents a

    composer from easily generating eight-channel spatialization. To create eight channels of

    MichaelThompson at the University of North Texas Center for Experimental Musicand Intermedia recently released an eight-channel format for Csound on the SGI.

    In the case of the AD AT, this synchronization problem can be solved by using an ADAT-to-SMPTE time code converter along with digital audio software capable of locking to SMPTE time code.

  • 31

    output, the composer would need to manually crossfade between two four-channel

    soundfiles. It is possible to create eight-channel compositions using Csound, provided

    the composer does not use its two- or four-channel soundfile formats. The composer

    would instead create instruments that generate eight separate soundfiles as output.

    Mixing software is required to play the individual soundfiles for recording onto eight-

    track tape.

    Current digital audio cards are intended for use as virtual multitrack tape

    machines. Most professional cards provide from four to sixteen outputs. Some

    multichannel cards include Digidesign's ProTools HI setup, Emagic's Audiowerk8, Korg

    1212, AD AT interface cards, and others. The Audiowerk8 card is typical in that it

    provides two channels of input and eight channels of output. The user chooses a single

    output pair for each track and pans this track to the left or right output. In order for the

    sound to move between the eight outputs, the user must crossfade between eight tracks -

    each of which is routed to a unique output. With this configuration, each sound could

    potentially occupy eight tracks. Since many compositions contain several layers of

    timbres, the card's playback capacity becomes an issue. Emagic's program Logic Audio

    software utilizing the Audiowerk8 card can play up to 24 tracks at one time. Thus, the

    maximum number of simultaneous sound events moving among the eight outputs is

    three.

    The ProTools in system is similar in its assignment of tracks to outputs. In this

    case, however, the software is open-ended and supports plug-ins developed by third-party

    software companies. These plug-ins enhance the capabilities of the core ProTools

    system. One company, APB Tools, has developed a plug-in named Sigma 1 (1997) that

    accepts up to sixteen inputs and routes each to any of sixteen outputs. The only

  • 32

    drawbacks to this system are its resolution (10 milliseconds) and its omission of spectral,

    velocity, or zenith cues.

    Besides the APB Tools plug-in, most multitrack mixing programs lack a means of

    controlling a sound's movement through the multiple outputs of a digital audio card.

    Like the multitrack tape systems they are designed to replace, these multichannel

    software and hardware systems are intended for creating stereo mixes. Thus, for the

    composer attempting to use multi-loudspeaker spatialization, having access to multiple

    outputs is not sufficient. The composer must also be able to control a sound event's

    intensity at each output simultaneously.

    A concurrent development in the evolution of multichannel sound systems is the

    large number of consumers currently using multichannel home theater systems. One

    system, Dolby Surround Sound Systems (DSSS), uses what it calls 5.1 and 7.1

    configurations. The "5.1" refers to a configuration of five normal loudspeakers with one

    subwoofer. The placement of the loudspeakers must be precise in order to benefit from

    the spatialization. The system routes three channels of audio to the left-, center- and

    right-front loudspeakers and two channels to the left- and right-back loudspeakers. Since

    the subwoofer radiates frequencies that are difficult to localize, manufacturers specify

    that its exact placement is unimportant. DSSS encodes the spatialization information

    along with the analog audio signals and sends them as a single input into the home theater

    receiver/amplifier. Until recently, access to the means of encoding a sound event with

    surround sound information had been restricted because encoding was only accessible

    through proprietary hardware. Post-production companies usually purchased this

    hardware for use in audio suite mixers. In June 1997, Dolby Labs released a ProTools

    plug-in for encoding the surround sound information.

  • 33

    The foregoing discussion has dealt with spatialization of sound events on the

    azimuthal plane. In order to add distance cues, it is necessary to add reverberation. Since

    reverberation is made up of thousands of echoes it is impractical to attempt to calculate

    each individual echo. Schroeder and other pioneers in this field developed useful

    reverberation by means of combining comb and allpass filters. Composers can adjust the

    intensity of the original sound event relative to the reverberated version of the sound

    event and, thereby, provide cues to distance.

    As stated in the previous chapter, local reverberation occurs because of reflective

    surfaces near the origin of the sound event. The listener perceives the local reverberation

    along with the sound event in the direction of the sound event. Listeners perceive the

    second type, global reverberation, equally from all directions. With global reverberation

    and multiple loudspeakers, problems arise when trying to spatialize the reverberated

    versions of a sound event to separate loudspeakers. The number of audio channels could

    easily double if the composer spatialized the original sound and the reverberated sound

    separately. In natural environments, listeners localize sound events from one direction

    while localizing reverberation from many other directions. Therefore, no matter what the

    location of a sound event in an eight-loudspeaker spatialization system, there will be

    eight tracks of soundfiles for each sound event. One loudspeaker might be the location of

    the original sound event, but the other loudspeakers should contain reverberated versions

    of that sound event.

    Diagonal Crossfades

    One difference between four-channel and eight-channel systems is the distance

    between diagonals. In a four-channel system, a unit one distance between adjacent

    loudspeakers would yield a V2 distance between diagonal loudspeakers. This yields an

    adjacent-diagonal ratio equal to approximately 1:1.41. The ideal crossfade formula for

  • 34

    adjacent loudspeaker intensity panning (such as from left- to right-front) would be close

    to the ideal crossfade formula for diagonal panning (such as right-front to left-back). For

    eight-loudspeakers that are arranged in a circle around an audience, this ratio is 1:2.61.

    Therefore, in eight-loudspeaker spatialization, intensity panning is less effective if one

    uses the same formula for adjacent and diagonal panning. The following table shows

    these ratios in four- and eight-loudspeaker configurations.

    Setup Adjacent Diagonal Ratio Distance Distance

    4 Loudspeakers 1 V2 = 1.41 1:1.41 8 Loudspeakers 1 2.61 1:2.61

    Table 3. Ratios for adjacent and diagonal distances.

    The following diagram illustrates how these values were derived.

    \ Quad Adjacent A = 1 \

    Figure 8. Four- and Eight-Loudspeaker distances.

  • 35

    Four Multichannel Spatialization Systems

    John Chowning's system was the first four-channel system that took into account

    localization cues for distance and velocity. In his highly successful piece, Turenas

    (1972), he made use of software that he designed and later discussed in his paper "The

    Simulation of Moving Sound Sources." In this work, four loudspeakers surround the

    audience. Chowning moved sound events by means of control trajectories that he drew

    onto a CRT screen. The computer then generated the four channels of audio. Chowning

    recorded these to four-track tape for later playback. In theory, Chowning could

    simultaneously spatialize an unlimited number of sound events over four loudspeakers

    with this non real-time processing.

    In 1990 Marina Bosi developed another four-channel system utilizing two

    Yamaha DMP-7 automated mixers controlled by software on the Apple Macintosh

    computer. The system spatialized signals that were split and sent to each mixer. Since

    this system relied on MIDI control of the mixers, it provided real-time control of the

    sound spatialization through trajectories drawn on the computer screen. Distance cues

    included reverberation built into the mixers. Since this system relied on the stereo outputs

    of two identical mixers, its principal limitation was that it could only spatialize sounds as

    a single group - not independently. This limitation is also found in a second real-time

    system called the Quad Pan system (1996) developed at the Laboratorio de Informatica y

    Electronica Musical in Spain. Quad Pan also requires two mixers with complete MIDI

    control, two stereo reverb units, a MIDI clock generator and a Macintosh Computer.

    Unlike the other systems, the one presented in this paper does not presently

    contain reverberation. An optional hardware addition to the Kurzweil K2500, which was

  • 36

    originally due to be publicly available in 1997, will provide reverb on each of the eight

    outputs and will enable the present system to add reverberation to its distance cues.

    A table summarizing the similarities of the previous three systems as well as the

    system presented in this paper is given below.

    •liAttributes - Name => Chowning Bosi Quad Pan Morgan Real-time input No Yes Yes No Real-time output No Yes Yes Yes Reverberation Yes Yes Yes No Doppler Yes Yes No Yes Number of indepen-dent sound events

    oo 1 1 6

    Number of outputs 4 4 4 8

    Table 4. Four multichannel spatialization systems.

    HRTF Spatialization Systems

    HRTF spatialization, the second category of spatialization systems, is the closed-

    field approach. Instead of using multiple loudspeakers, a composer presents soundfiles

    over headphones or two precisely positioned loudspeakers. This approach is also referred

    to as 3-D audio since the sound is made to appear as if it were originating from positions

    above or behind the listener instead of simply from the two loudspeakers.

    To use the HRTF as a spatialization system, it is necessary to first take HRTF

    measurements. In an anechoic environment, researchers play impulses from locations on

    an imaginary grid around the listener or mannequin head. They use an impulse because

    of its wide bandwidth. The researchers simultaneously record the impulse at each ear to

    obtain a binaural impulse response for the sound at that location.29 After compensating

    29T Durand R. Begault, 3-D Sound For Virtual Reality and Multimedia (Boston- AP

    Professional, 1994), 135.

  • 37

    for the impulse responses of the loudspeakers and microphones, researchers obtain a

    graph indicating the frequency response of the outer ear for each location.

    After obtaining the measurements, it is a simple matter to spatialize sound events

    using the HRTF information. Sound events that are to be spatialized at a location are

    convoluted with the measurement for that location. This convolution yields a stereo

    signal with the same binaural information the sound event would have possessed had it

    come from that location originally. Adding environment and distance information is a

    similar process. Researchers convolute the HRTF for a location with the impulse

    response of an acoustic space. The impulse response of a space will include the

    frequency response of the space's reverberation, echo and decay time. This approach to

    creating virtual auditory environments modeled after a real or imaginary space is called

    auralization.

    While the HRTF assists understanding localization where the Duplex Theory is

    inadequate, implementing HRTFs as a sound spatialization tool also introduces several

    new problems. The first is in obtaining HRTF measurements that work well for more

    than one person. One method of obtaining these measurements involves using

    mannequin heads with microphones mounted in place of the tympanic membrane. A

    second approach involves using probe microphones with live test subjects. Since each

    person has a unique anatomy and a unique HRTF, the first problem arises when using one

    set of measurements for another. Averaging multiple HRTFs is not adequate since the

    differences often cancel out from one listener to the next. For this reason, a distinction is

    made between individualized HRTFs made from a single listener and non-individualized

    HRTFs that are the result of averaging. Another successful method of obtaining useful

    HRTF measurement is through mathematically modeling the impulse response of the ears

    by averaging the structural qualities of the outer ear.

  • 38

    A second problem with spatializing by means of spectral shaping is that it

    presupposes a listener's familiarity with the sound's original spectrum. Localization is

    usually much more accurate for familiar sounds such as the human voice. On the other

    hand, localization is less accurate for unfamiliar sounds such as those created with

    software synthesis.

    A third problem with using the HRTF as a spatialization tool arises when

    reproducing sounds processed with HRTF information. The sound events must be

    reproduced in a closed listening environment (such as headphones or precisely positioned

    loudspeakers) because even slight head movements significantly affect the spectral

    modifications that a sound event undergoes on its way to the ear canal. One disadvantage

    associated with headphones is that users often localize the sound events inside the head.

    The main disadvantage with HRTF spatialization is that these closed field setups are not

    practical for most concerts of electroacoustic music. Despite this problem, HRTFs will

    continue to play an important role in compositions that rely on virtual reality because it is

    a more economical method of reproducing acoustic environments for a single user.

    Since the calculations required for spatializing sound events are extremely CPU-

    intensive, there are presently no general-purpose computers that are capable of creating

    3-D audio spatialization in real-time without additional hardware. There are, however,

    software programs that process the calculations off-line. One example is Thomas Erbe's

    adaptation of HRTF measurements taken from Jens Blauert. In Erbe's program

    SoundHack the user opens a sound file and chooses the binaural filter option. The user

    chooses to place a sound at a location in the 360° field surrounding the listener or chooses

    the moving angle option and draws in a trajectory. The program outputs a stereo file

    with 3-D control of the soundfile's azimuth location. It does not presently deal with

    zenith, distance or reverberation.

  • 39

    The Csound synthesis environment also provides non real-time HRTF processing.

    The Csound program generates audio files by "playing" a score file with instruments

    built by the composer. Programmers have supplemented the built-in instrument opcodes

    with HRTF unit generators. These opcodes convolute audio signals with HRTF

    measurements based on user-determined sound trajectories.

    Summary

    HRTF spatialization systems rely to a great extent on the location of the listener.

    This approach normally requires a closed-field system (headphones). Sounds are

    processed with filters and reverberation to provide the same localization cues the listener

    would encounter if the sounds had been heard in the free field. While this second

    approach is more processor-intensive, it has the advantage of being able to simulate many

    environments with fewer pieces of hardware, such as amplifiers and loudspeakers.

    Nevertheless, it is not practical for free-field situations in which there are many listeners

    in one location. For computer music presented in these free-field loudspeaker concert

    setups, the prevailing attitude has been that this processing is not only imperceptible to

    listeners in some locations, but is detrimental to listeners in other locations.30

    In its defense, however, dependence on listener location - not to mention head

    position - is not limited to HRTF-based spectral cues alone. Intensity panning is based

    on varying the intensity level of two identical sound sources between two loudspeakers.

    The crossfades are calculated based on an imaginary listener centered between the two

    loudspeakers. Changing the position of the listener relative to the two loudspeakers has a

    profound effect on the localization of a sound event. The issue then becomes the

    resolution of the loudspeaker setup. While Chowning and many others consider four

    ^"Marina Bosi, "An Interactive Real-time System for the Control of Sound Localization," The Computer Music Journal XIV, No. 4, 1990, 59.

  • 40

    loudspeakers the minimum number, Blauert considers six loudspeakers the minimum.31

    If more loudspeakers surround an audience, the resolution of the intensity panning will be

    higher and the audience positioning can be more flexible. With more loudspeakers,

    however, there are problems with diagonal intensity panning and with finding means for

    controlling individual sound event spatialization.

    One can argue that a primary factor keeping computer music in the tradition of

    conventional concert music is the necessity of presenting stereo pieces from two

    loudspeakers located on a stage and directed at an audience. This two-channel setup is

    less effective in a non-traditional listening space such as a gallery or sound installation.

    These problems are mitigated with a higher resolution listening space made up of

    multiple loudspeakers.

    In theory, the multi-loudspeaker and 3-D audio approaches are initially

    incompatible. Composers usually choose one or the other according to the intended

    audience. If the audience is a multimedia user, the 3-D audio approach works best. If the

    piece is intended for a concert venue, then a multi-loudspeaker setup is more appropriate.

    In practice, however, composers often use the two approaches simultaneously.

    Composers use localization cues to modify sound events that are intended for

    performance in the free field even though these cues may not be perceptible to all of the

    audience.

    31Blauert, 273.

  • CHAPTER IV

    PREVIOUS COMPOSITIONS UTILIZING EIGHT LOUDSPEAKER SYSTEMS

    Octophonic Etude I: Circles (1993)

    The Merrill Ellis Intermedia Theater (MEIT)32 eight-loudspeaker sound system

    has allowed many experiments with sound spatialization. Four stereo amplifiers power

    the eight loudspeakers arranged in a circle around the audience seating area. These

    amplifiers receive their inputs from an eight-bus mixer. The diagram below illustrates

    the loudspeakers' arrangement and mixer bus assignments.

    Figure 9. MEIT loudspeaker setup.

    This configuration permits independently routing a sound to each of the eight

    loudspeakers, even though the loudspeakers are arranged as four stereo pairs. By using

    32This facility is located in the Center for Experimental Music and Intermedia (CEMI) at the University of North Texas.

    41

  • 42

    MIDI devices with multiple outputs and MIDI sequencing software, one can control the

    sound to each loudspeaker. For the first Etude, each channel of an EMU Proteus sound

    module supplied input to a separate loudspeaker. Proteus modules have six outputs

    configured as three stereo pairs. Two modules were needed to obtain eight outputs. By

    routing each MIDI channel to a unique output, the timbre on that MIDI channel appeared

    at a single loudspeaker location. For a timbre to sound as if it was moving through the

    eight loudspeakers, the same timbre would need to be assigned to eight MIDI channels.

    Considering the MIDI specification's limit of sixteen channels, this configuration

    permitted the use of two unique timbres. However, each timbre moves through the eight

    loudspeakers independently of the other timbre. The following table shows the Proteus

    configurations for a single timbre.

    MIDI Channel Proteus Output Pan MEIT Loudspeaker 1 I Main L 7 2 I Main R 2 3 I Sub-out 1 L 6 4 I Sub-out 1 R 4 5 II Main L 8 6 II Main R 3 7 II Sub-out 1 L 5 8 II Sub-out 1 R 1

    Table 5. MIDI channel mapping for Octophonic Etude I.

    A program was written to increment the channel/loudspeaker number while

    repeating a single pitch. The resulting text file was converted into a Standard MIDI File

    with David Zicarreli's TabConverter. This file was then imported into a MIDI

    sequencing program. When played with the sequencing software, each successive

    sounding of a pitch originated from a different loudspeaker. With the configuration

    shown in Figure 9, this produced a "rotating" sound field. Adjusting parameters in the

  • 43

    program produced different rates of rotation. The sequencing software supported

    duplicating, transposing and assembling these circles into the final version of the piece.

    The next step involved choosing timbres that were identical to each Proteus sound

    module. Even slightly dissimilar sounds (such as a clarinet sampled differently for each

    model) disrupted the continuity of a single sound event moving through the eight

    loudspeakers. Simple waveforms (sine, square, etc.) were the only sounds that were

    identical between the two modules. These sounds proved unmusical and difficult to

    localize because of their pure harmonic content. Even with different sounds, however,

    one could still perceive the rotating sound fields since each circle had its own rate of

    rotation.

    Nevertheless, this spatialization system was revised for the Kurzweil K2000. The

    K2000 has eight independent outputs - also grouped as stereo pairs - making it possible

    to use a single device to supply all eight loudspeakers. This eliminated the problem of

    matching timbres. The revised channel/loudspeaker setup is given below.

    MIDI Channel K2k Output Pan MEIT Loudspeaker 1 A L 7 2 A R 2 3 B L 6 4 B R 4 5 C L 8 6 C R 3 7 D L 5 8 D R 1

    Table 6. MIDI channel mapping for the K2000.

    In addition, the K2000 offered digital signal processing (DSP) options such as filters and

    wave shaping. As a result, one could manipulate other sound characteristics besides

    spatialization.

  • 44

    Octophonic Etude II: Hsu, Waiting (1995)

    In Etude I, each MIDI channel was panned to a single output. By assigning each

    timbre to a single channel, that timbre originated from one of the eight loudspeakers.

    This is the lowest possible resolution of an eight-channel spatialization system. Most

    MIDI devices also support intensity panning between outputs. Using this form of

    panning might have improved spatialization with eight loudspeakers had it not been for

    several critical limitations. First, MIDI Pan messages (controller ten) control the location

    of a sound between a stereo pair of outlets. This controller does not pan between outputs

    that are not stereo pairs. The diagram below illustrates this limitation.

    /

    / _ *> \r

    Figure 10. MIDI Pan limitations: AL to AR but no AR to BL.

    In addition, MIDI Pan messages often do not have the same resolution as other MIDI

    controllers. For example, MIDI Volume (controller seven) has a seven-bit resolution

    resulting in 128 discrete steps. While MIDI panning responds to all 128 values, it

    quantizes these values to a resolution of 15 discrete locations. The table on the following

    page illustrates this effect.

  • Number MIDI Controller 10 values Pan outputs 1 1 - 8 * hard-left 2 9-16 * 3 17-24 * 4 25-32 * 5 33-40 * 6 49-56 * 7 57 - 64 * 8 65 - 72 * center 9 73-80 * 10 81 - 88 * 11 89-96 * 12 97 -104 * 13 105-112 * 14 113-121 * 15 122 -128 hard-right *

    Table 7. MIDI Pan values and locations.

    There are two additional limitations of MIDI-based intensity panning that make it

    unsuitable for multi-loudspeaker spatialization. First, the intensity panning messages

    only affect the location of notes that sound after the panning messages are received.

    Notes that are already sounding are not affected. Second, MIDI intensity panning does

    not provide additional localization cues such as spectral or arrival time differences.

    The K2000's Panner DSP object is an alternative method of panning sound

    events. The Panner object can pan a sound between two outputs, but does not have the

    limitations of MIDI controller panning. For instance, this object can pan a sound

    between two outputs that are from the same stereo pair (e.g., output A hard-left to output

    D hard-right). The Panner object also has a 128-step resolution. Lastly, unlike MIDI Pan

    messages, the K2000 Panner object pans sounding note events. Therefore, using the

    K2000 Panning object proved to be a viable option for improving the resolution and

    45

  • 46

    effectiveness of the spatialization. However, before proceeding with the details of

    Octophonic Etude II, an overview of the K2000 architecture is necessary.

    K2000 Programs, Layers and Algorithms

    A sound or timbre on the K2000 - called a "program" - is made up of one, two or

    three "layers"33. Each layer of a program contains a keymap of one or more samples and

    an algorithm. The algorithm routes the keymap signal through a configuration of DSP

    objects. Each of the 31 possible algorithms is unique in its signal flow and DSP choices.

    Some algorithms route the signal straight through each stage while others split the signal

    and recombme it at a later stage. Each algorithm contains five stages. The first is always

    the PITCH stage and the last is always some type of AMP stage. According to the

    algorithm, the three middle stages can be occupied by one-, two-, or three-parameter DSP

    objects. The choice of DSP objects varies according to the algorithm and includes many

    types of filters, waveforms, wave transformers, mixers, etc.

    A control source input can affect each of the five stages. If the algorithm contains

    only a single DSP object, that object will still have three control source inputs. For

    example, a single DSP object such as the parametric EQ function contains control source

    inputs for high frequency, low frequency and bandwidth. Alternatively, a three-stage

    algorithm might use a lowpass filter followed by a shaper, and an allpass filter. The

    lowpass filter has a control source input affecting its cutoff frequency; the shaper has a

    control source input affecting its amount of shaping; the allpass filter has a control source

    input affecting its center frequency. These control source inputs allow each of the five

    stages to be manipulated in a variety of ways. For example, amplitude can be linked to

    note velocity tracking, note number tracking or real-time controllers such as the

    An exception is the "drum program" discussed later in this chapter.

  • 47

    modulation wheel (MIDI controller 1). The K2000 can also receive "unassigned" MIDI

    controllers such as MIDI 22 that originate from a keyboard or a computer. Finally, the

    K2000 also has its own local controller sources such as low frequency oscillat