Wide Band Spech and Audio Coding

download Wide Band Spech and Audio Coding

of 11

Transcript of Wide Band Spech and Audio Coding

  • 8/3/2019 Wide Band Spech and Audio Coding

    1/11

    WidebandSpeech and Audio CodingPowerful algorithms and standa rds are now available to en hanc e services incomm unication-based and storage-based audio-only and audiovisual applications.Peter No11

    t o min imize

    lthough high bit ratec han nelsa ndn et-w o r k s h a v e b e c o m e m o r e e a s il yaccess ib l e , l ow b i t ra t e cod ing ofspeech and aud io signals has retainedits importance. The main motivationsfor low bit rate coding are the n eedt ransmi ssion cos t s o r p rov ide cos t -e f fi c ie n t s t o r a g e , a n d t h e d e m a n d t o t r a n s m i tover chann els of limited capacity such as mobile radioc h a n n e l s . I n a d d i t i o n , t h e r e m a y b e a n e e d t osha re capacity for different services such as voice,aud io , da t a . g raph i cs , an d images i n i n t egra t edserv i ces ne tworks , and t o supp or t var i ab l e - ra t ecoding in packet-oriented networks.B a s ic r e q u i r e m e n t s in t h e d e s i g n o f l ow b i tra t e codcrs a re : 1) high qual ity of recons t ructedspeech or audio signals with robustness tovariationsin spectra and levels; 2) robustncss, also requiredt o r a n d o m a n d b u r s ty c h a n n e l b it e r r o r s a n dpacket losscs; 3) low complexi ty and power co n-s u m p t i o n of t h e c o d e r s ( w h i c h a r e h ig h ly r e l e -vant).For example, the complexity of audio decodersshould be l ow enough t o supp or t l ow-cos t so lu -t i o ns b e c a u s e b r o a d c a s t a n d p l a yb a c k a p p l i c a -t i ons p l ay a dominan t ro l e i n w ideband aud io .Addi t ional n etwork-related requirem ents are lowencod eridecod er delays, robust tandem ing of codecs,transcodeability, and a graceful degradation of qual-itywith increasing bit err or rates in mobile radio andbroadca st ap pl icat ions. All these part ly confl ict-i n g f a c t o r s h a v e t o b e c a r e f u ll y c o n s i d e r e d i ns e l e c ti n g a w i d e b a n d s p e e c h or a u d i o c o d i n galgorithm for a given application.E a r l i e r p r o p o s a ls t o r e d u c e t h e P C M r a t e shave fol lowcd those for narrowband speech cod-ing. However, differences between audio and speechsignals are m anifold, since audio coding impl ieshigher values of sampling rate, amplitude resolutiona n d d y n a m i c r a n g e , l a r g e r v a r i a t i o n s in p o w e rdensi ty spect ra, di fferences in hum an p ercept ion,and high er listener expectations of quality. Speechcoding can be so efficient because spccch signalshave an underlying vocal t ract product ion mo del ,whereas for audio, in general , this is not the case.T h e r e h a s b e e n r a p i d p r o g r e s s i n c o d i n g of

    PL T ER NOLL i u p ~ ~ f r t . \ ~ r -~ ~ ~ l ~ l e c ~ l ~ l l ? i l i r f i ~ ~ l l f ~ l ~ l \/h e Tccliizrcul U r u w r u nqf Berlin urid hc chuirs /lieludio Subgroup iwliirzISOIlLIPEG

    speech a nd au dio signals. Linear predict ion, sub-band coding, transform coding, and various formsofvector quantization and entropy coding echniqueshave been used to design efficient coding algorithmsthat can achieve substantially more com pression thanwas thought possible only a few years ago. Recen tr e s u l t s i n d i c a t e t h a t g o o d t o e x c e l l e n t c o d i n gquality can or will soon be obta ined with bit ratesof 1 b i s a m p le f o r s p e e c h a n d w i d e b a n d s p e e c h ,an d 2 bi sample fo r aud io . Ra t e reduc t i ons t o 0.5a n d 1 b i s a m p l c , r e s p e c t iv e l y , c a n b e e x p e c t e dover t he nex t decade . These h igh reduc t i ons a reachieved by employing percep tual coding echniqu es,so t h a t o n l y t h o s e d e t a i l s of t h e s i g n a l t h a t a r epercept ible by e ar wi ll be t ra nsmit ted . Applyingknowledge of aud i t o ry percep t ion l eads t o hear -ing-specific coders that perform remarkab ly well.I t i s i m p o r t a n t t o n o t e t h a t b i t r a t e r e d u c e ddigital representations ofsource signals can be muchmore robus t t o channe l impai rments t han ana logt e c h n i q u e s i f s o u r c e a n d c h a n n e l c o d i n g a r eimple men ted efficiently. In addition, with todaysdat a compression a nd mul t i level signal ing tech-niques, on e can actual ly reduce the bandwidth bygoing digi tal , band width exp ansion is no l ongerthe p r i ce t o be pa id fo r d ig i ta l cod ing and t rans-mission.In this article, we will describe typical pa ram e-ters of wideband sp eech an d au dio signals, includ-ing digitizedver sions of each ; poten tial applications;and available transmission m edia. W e will brieflyint roduce facts about human audi tory percept iont h a t a r e e x p l o i t e d i n a u d i o c o d i n g a n d q u a l it ym e a s u r e s t h a t p la y a n i m p o r t a n t r o l e in c o d e revaluations an d designs. Then we will describe tech-n i q u e s o f e f f i c ie n t c o d i n g o f w i d e b a n d s p e e c hand aud io signals,with an emphasis on existing stan -dards . The recen t ISOiMPEG audio coding stan-d a r d i s c o v e r e d i n s o m e d e t a i l , s in c e i t w il l b eused in many ap plication area s, including digital stor-age, transmission, and broadcasting of audio-onlys i g n a l s a n d a u d i o v i s u a l a p p l i c a t i o n s s u c h a svideo-telephony,video-conferencing,and TV broad-cast ing. Finally, ongoing reserach a nd stand ard-ization work will be outlined.

    34 Olh3-(,804/93/S03.00 0 993 IEEE IEEE Communicat ions Magazine November 1993

  • 8/3/2019 Wide Band Spech and Audio Coding

    2/11

    Signals and Signal DeliverySignalsTelephon e speech, wideband speech, and w idebandaud io signals differ not only in bandwidth a nd dynam-ic range, but also in listener expectation of offeredquality. Th e conven tional digital forma t is PCM , withtypical sampl ing rates a nd am pl i tude resolut ions(PCM bi ts per sample) as shown in Table 1.WidebandSpeech- igher bandwidths than thatof the 300 to 3400 Hz telephone bandwidth resul tin major subject ive improvem ents in representedspeech quality. A bandwidth of 50 t o 7000 Hz no tonly improves the intel l igibi li ty and na turalne ssof speech , bu t ad ds a l so a fee l i ng o f t ransparen tcommun ica t i on , and eases speaker recogni t i on .Appl i ca t ions o f h igh re l evance a re l oudspe akert e l ephony , IS DN conferenc ing sys t ems, mul t i-po in t i n t e rac t i ve aud iov i sua l communica t i ons ,an d t he use o f commentary channe l s fo r b road-casting. I n 1986,CCI TT recommended a 64 kbi swideband speec h coder deve loped pr imar i l y fo rt ransmission over the ISD N basic rate (B ) chan-ne l [l] .This G.722 wideband speech cod er will bedescribed later in this article. Curr ent activities inwideband speech coding concentrate on coding at32 kbis and below, wi th the 64 kbis CC ITT s tan-dard serving as reference.Audio- he comp act disc (CD) has ma de digi -t a l a u d i o p o p u l a r , i t s 16 b P C M f o r m a t i s a naccepted audio representat ion standard . In audiop r o d u c t i o n r e s o l u t i o n s u p t o 2 4 b P C M a r e i nuse . On a compact d i sc s i gna l s o f 20 kHz band -width and 44.1 kHz sampl ing rate a re stored wi tha resolut ion of 16 bi sample . Hence t h e resu l t i ngnet bi t ra te i s 44.1 x 16 = 705.6 k b i s p e r m o n o -phonic channel. A significant overhead is neededfor a l ine code that m aps 8 informat ion bi ts into14 bits, for synchronization, and fo r erro r correction.B i t e r ro r ra t es fo r c l ean d i sks a f t e r e r ro r cor rec-t ion are around lo- ' I , but handling (fingerprints)may bring thisvalue down to Tab le 2 lists som eparameters o f t he C D and t he d ig it a l aud io t ape(DAT).Th e Moving Pictures Expert Gr oup wi thin theIn t e rna t i ona l Organ i za t i on o f S t andard i za t i on(ISOIMPEG)has been developing, a series of audio-visual standards. Its recent audio coding standardi s t he f i r s t i n t e rna t i ona l s t a ndard i n t he f i e l d o fh i g h q u a l it y d i g i ta l a u d i o c o m p r e s s i o n a n d i sabout to becom e a standard in many other appl i -cat ion arcas, both for consum er and professionalaud io [2 , 31. Decod er chips are al ready avai lable,a first consume r produ ct, Philips Digital CompactC a s s e t t e ( D C C ) m a k e s u s e o f L a y e r I of t h eISOiMPEG coder.Typical appl ication areas fo r digi tal audio a rein the fields of audio pr oduct ion, progra m dist ri -but ion and exchange, digi tal soun d broadcast ing( D S B ) ] ,digital storage (archives, studios,consumerelectronics). Digital audio is also useful for inter-personal communicationssuch asvideo-conferencingand m ultimedia applications, and for cnha nced qual-ity T V systems.Multichannel Stereophony -As a logical fur-t her s t ep in digital audio a universal loudspeaker

    Table 1. Typical values of basic parameters o jthree classes of acousticsignals.

    .Table 2 . Formats of CD and DATstorage formats .

    r e p r o d u c t i o n s t a n d a r d m u s t b e d e f i n ed t o p r o -vide an improved stereop honic image for audio-onlyappl i ca ti ons i nc lud ing t e l econferenc ing and fo rimproved television systems. Loudspeake r arran ge-ments, referred t o as 3/2-stereo,with a left and a rightchanne l (L a n d R ), an add i t i ona l cen t e r channe lC and two s ide irear su r roun d channe l s (Ls a n dRs), mprove the presentation performan ce, both inaudio-only applications and in audio-with-picturereproduction, where directional imagingdistortions,i .e., angular displacem ents betweenvisual an d audi-tory images , play a significant role. In particular, thet h r e e f r o n t i o u d s p c a k c r s e n s u r e a s u f fi c ie n td i r e c t i o n a l s t a b i li t y a n d c l a r i t y o f t h e f r o n t a lsound image (i .e., a stable middle and an enlarge dlistening area). An additional pair of surround loud-speakers2 may be useful fo r collective viewing sit-uations, e.g., HDTV-cinem a, but a3i2-stereo form atis considered to be a goo d compromise both f or pro-duction and transmission [4].

    E x a m p l e s o f d i g i ta l m u l t i c h a n n e l s u r r o u n dsystems are the upcoming ISO iMP EG 3i2-stereocoding s t andard and Dolby 's S t e reo SR. D sys-tem based on its AC-3 audio coding algorithm. Bothsystems offer an addi t ional o pt ional low frcqucn-cy effect (subwoofer) channel , to repro duce fre -quenc i es be low a round 120 Hz wi th o ne o r moreloudspeakers which can b e pos i t i oned f ree ly i nthe listeningroom. The overall bitrate for a3/2-stereosystem will possibly fit into the 384 kbis HO chan-nel of the ISDN hierarchy (see below).Signal Delivery

    Delivery of digital speech an d au dio signals is pos-sible ovcr terrestrial and satellite-based digital broad -cast and t ransmission systems such as subscriberlines, program exchange links, cellular mobile radionetworks, cable-TV networks, etc.lSDN- i th ISD N, custome rs have physical accessto a numb er of relat ively low-cost dial -up digi taltelecommun ications channels. For the transmissionof wideband spe ech an d audio signals the basic-rate interface isof interest; it consists of two64-kb/s

    The Europecriz teini isDigitalAudio Broadcasl-irig (LIAB)

    A 3l4-stcreoformat doestw t inzply tw o crdditioriulsigrlcr1.Y; olle or tw o sur-round sigtials ma?;ee dIw o orfoirr sidelrear loud-spcw kers.

    IEEE Communicat ions Magazine November 1993.___. 35

  • 8/3/2019 Wide Band Spech and Audio Coding

    3/11

    H Figure 1. Threshold in quiet and ma sking threshold (after (61).Acousticalevents in the hatched areas will not be audible.B c h a n n e l s a n d o n e 1 6 k b l s D c h a n n e l ( w h i c hsuppor t s s i gna l ing bu t can a l so car ry user infor-mat ion). Th e primary-rate interface i s ei ther a 23B + D configurat ion (North America and Japan )or a 30 B + D conf igura t i on (Europe) where t heD c h a n n e ls o p e r a t e a t 6 4 kbis. Both conf igura-tions sup port also 384 kb/s HO channels, an d vari-ous combinat ions. F rom the se num bers i t i s cleart ha t IS DN offe rs usefu l channe l s fo r a p rac t i ca ldistribution of stereophonic a nd multichannel audiosignals.l VDL AN - h e I n t e g r a t e d V o i c e i D a t a L o c a lA r e a N e t w o r k ( I V D L A N ) ( I E E E 8 02 .9 ) c a ncop e with real t ime constraints. It provides a highbandwidth packet service (P channel) and a numbe rof ful l -duplex isochronou s digital channels (B, C,a nd D channels), simi lar to I SDN channels. Ther ear e two 64 kb is B channe l s , a 16 kbl s o r 64 kb / sp a c k e t c h a n n e l , a n d a m x 6 4 k b i s b r o a d b a n dchannel , simi lar to IS DN H channels.DS B- atel li te-based or terrest rial digi tal audiobroadcas t i ng i s a complex t ask , i n par t i cu l a r , ifl i steners use mobi le an d portable receivers. Mul-tipath interference and selective fading are the m ainimpairmen ts to be expected. In addition, a broad-cast network chain can include sectionswith differen tq u a l it y r e q u i r e m e n t s r a n g i n g f r o m p r o d u c t i o nquality which supports editing, cutting, and post-processing, to commen tary grade qual i ty which iscapa ble of del ivering sp eech of excel lent qual i tyand musica l p rogram mater i a l a t a reduced l eve lo f p e r f o r m a n c e t o t h e l i s te n e r . T h e C C I R h a smad e extensive tests to define audio coders to beused for DSB. In i t s ongoing work CC IR is prepar-i ng a d ra f t recommendat i on t ha t t he IS OiMPEGLayer I1 audio coding algori thm wi th an inde pen-den t coding of the left and rigth information shouldbe used for digital aud io contribution links (whichsuppor t exchangeofprograms)at 180 + 12kb/s (thel a t t e r r a t e i s f o r d a t a ) , f o r d i s t r i b u t i o n l i n k s(which t ransmit the sound to the e mit ters) at 120

    + 8 k b l s, a n d f o r e m i s s i o n a t 1 2 8 k bl s . T h eISOiM PEG Layer I11 coder is recommen ded fo rcommen tary links at a rate of 60 + 4 kbls.Perception and QualityMeasuresPerceptionA u d i t o r y p e r c e p t i o n i s b a s e d on c r i t i c a l b a n dana lys i s i n t he i nner ea r whe re a f requency- to -p l a c e t r a n s f o r m a t i o n o c c u r s a l o n g t h e b a s i l a rmembrane . Th e power spec t ra a re no t represen t -ed on a l inear frequenc y scale but on l imited fre-q u e n c y b a n d s c a l l e d c r i ti c a l b a n d s [5 ] . T h eaudi to ry sys t em can be descr i bed as a bandpassf i l t e r b a n k , c o n s i s t i n g o f s t r o n g l y o v e r l a p p i n gbandpass f i l t e rs w i th bandwid ths i n t he o rd er o f100 Hz fo r s i gna l s be low 50 0 H z a n d u p t o 5000Hz fo r s i gna l s a t h igh f requenc i es . U p t o 24 ,000Hz 26 critical bands ha ve to be taken into acco unt.SimultaneousMasking- imultaneous maskingis a frequency dornain phenom enon wh ere a low-level signal , e.g. , a pure tone (the ma skee) can b emad e inaudible (masked) by a simultaneouslyoccur-ing st ronger signal (the masker) , e .g. , smal lbandnoise, if masker a ndmask ee are close enough to eacho ther i n f requency [6 ] . A mask ing t h resho ld canbe measured belowwhich any signal will not be aud i-ble. The masking threshold depe nds on the soundpressure level (SPL) and the frequency of the masker,and on the characterist ics of maske r and m askee.Fo r example, wi th the masking threshold for theS P L = 60 dB masker i n F ig . 1 a t a r o u n d 1 k H z ,the S PL of the maskee can be surprisingly high-it willbe masked as ong as t s SPLis below the mask-ing threshold. The slope of the masking thresholdis steep er owards lower frequencies, i .e., higher fre-quencies are more easi ly masked. It should be notedt h a t t h e d i s t a n c e b e t w e e n m a s k i n g l e v e l a n dmasking threshold is smaller in noise-masks-toneexperiments than in tone-masks-noise experiments.Noise and low-level signal contributions ar e mask edinside and outside th e particular critical band if theirSPL is below the m asking threshold. Noise con tri-butions can b e cod ing noise, aliasing distortions, andtransmission errors.W itho ut a masker, a signa l i s inaud ible i f it sSPL is below the threshold of quiet, which dep end son f r e q u e n c y a n d c o v e r s a d y n a m i c r a n g e ofm o r e t h a n 60 dB as shown in t he l ower curve o fFig. 1.Th e qualitative sketch of Fig. 2gives more detailsabo ut the masking threshold: the distance betweenthe level of the masker (shown as a ton e in Fig. 2)a n d t h e m a s k i n g t h r e s h o l d i s c a l le d s i g n a l - t o -m a s k r a t i o ( S M R ) . I t s m a x i m u m v a lu e i s a t t h eleft border of the cri t ical band (poin t A). Withina critical ban d, coding noise will not be audible aslong as i t s s i gna l- t o -no ise ra t i o (SNR ) i s h igherthan i t s SMR. Let SNR(m) be the signal-to-noiseratio resulting from a m-bit quantization, the per-c e i v a b l e d i s t o r t i o n i n a g i v e n s u b b a n d i s t h e nmeasured by the noise-to-mask rat io NMR(m) =S M R - SNR (m) (in dB). The noise-to-mask rat ioNM R(m ) describes the di fference between the cod-ing noise in a given sub band an d the level wher ea distort ion may just beco me audible; i t s value indB should be negative.

    36 IEEE Comm unicat ions Magaz ine Novembci- 1993

  • 8/3/2019 Wide Band Spech and Audio Coding

    4/11

    W e h av e ju s t d e sc r ib ed mask in g b y o n ly o n em a s k e r . I f t h e s o u r c e s i g n a l c o n si s t s of m a n ys imu l tan eo u s mask e rs , a g lo b a l mask in g th re sh -old can be com puted that describes the thresholdof just noticeable disto rtion s as a function of fre-q u en cy . Th e ca lcu la t io n o f th e g lo b a l mask in gthreshold is based on the high resolution short-termamplitude spe ctrum of the aud io or speech s ignal ,sufficient for critical-band-based analyses, and isdetermined in audio codingvia a 512- or 1024-pointFFT.In a first step all individual masking thresholdsa re d e te rmin ed , d ep en d in g on s ignal level , typeof masker (noise or tone ), and frequency range. Next,the global masking threshold is determi ned by addingall individual m asking thresholds a nd the thresh-old in quiet. (Adding this latter threshold en surest h a t t h e c o m p u t e d g l o b a l m a s k i n g t h r e s h o l d isn o t b e lo w th e th rc sh o ld in q u ie t ) . Th e e f fec t s ofmasking reaching over critical band bo unds m ust beincluded in the calcu lation. Finally the global signal-to -mask ra t io ( S M R ) i s d e te rmin ed a s th e ra t ioo f t h e m a x i m u m o f t h e s i gn a l p o w e r a n d t h eglobal masking threshold (or as he difference of thecorresponding levels in dB ), as shown in Fig. 2.Temporal Masking- n addit ion t o s imultane-ou s masking, twot ime domain phenom ena a lso p layan important ro le in human aud itory perception ,pre-masking, and post-masking. The tem poral mask-ing effects occur befo re an d af ter a m asking s ig-n a l h a s b e e n s w i t c h ed o n an d o f f , r e sp ec t iv e ly(Fig . 3). Th e d u ra t io n wh en p re -mask in g ap p l ie sis less tha n - r as new er results indicate, signifi-can t ly l e s s th a n - n e - t e n t h t h a t o f t h e p o s t -mask in g , wh ich is in th e o rd e r of 50 t o 2 0 0 m s .Both pre- and postmasking are being exploited inth e ISOiMPEG audio coding algorithm.Perception-based Coding- n perception-basedcoders the e ncoding process is controlled by the glob-al signal-to-mask ra tio vs. frequency curve. If th enecessary b it ra te for a complete masking of d is-tortions is available the cod ing schemewill be trans-paren t, i.e., the decoded signal is indistinguishablefrom the source s ignal . I f the necessary b it ra tefor a com plete masking of distortions is not avail-a b l e , t h e g l o b a l m a s k i n g t h r e s h o l d s e r v e s a s aspectral error weighting function: the resulting errorspectrum will have the shape of the g lobal mask-ing threshold.

    Figure 2 .Masking threshold and SMR. Acoustical events in the hatched areaswill not be audible.

    In p rac t ical d e s ig n s o f p e rcep tu a l co d in g wecanno t go to the l imits of masking or just notice-able distortion, since postprocessing of the acous-ticsignal (e.g., filter ingin equalizers ) by the end- useran d mu l t ip le en co d in g id eco d in g p ro cesse s mayd emask th e n o ise . In ad d i t io n , s in ce o u r cu r ren tknowled ge about a uditory masking is very l imit-ed , thc hearing model used in th e design of a par-t icular perception -based audio code r may not beaccurate enough. Therefore, as an additional require-men t, we need a suff ic ient safe ty margin in prac-tical designs of code rs.Quality MeasuresD i g i t a l r e p r e s e n t a t i o n s o f a n a l o g w a v e f or m sc a u s e t h e i n t r o d u c t i o n o f s o m e k i n d o f d i s t o r -tions that can be specified by subjectivecriteria, sucha s t h e m e a n o p i n i o n s c o r e (MOS) as a measu reof perceptual similarity;by simple objective criteria,su ch a s th e s ig n a l - to -n o ise r a t io a s a measu re o ft h e w a v e f o r m s i m i la r it y b e t w e e n s o u r c e a n dreconstructed s ignal; or by complex cr i ter ia serv-ing as objective measures of perc eptu al similarity,w h i c h t a k e i n t o a c c o u n t f a c ts a b o u t t h e h u m a nauditory perception.

    Figure 3. Temporal Masking (after [6]) .Acoustical events in the hatched area will not be audible.

    IEE E C o m m u n i c a t i o n sMagazine November 1993 37

  • 8/3/2019 Wide Band Spech and Audio Coding

    5/11

    -peechcoding witha bandwidthwider thanthat ofleredin telephonyresults inmajorimprove-ments inrepresentedspeech1 .quality

    Figure4. tructure of CCITT G.722wideband speech coder,T h e m o s t p o p u l a r s u b j e c ti v e a s s es s m e n tmethod i s t he mean op in ion scor ing where sub-jects classi fy the qual i ty of code rs on a n N - p o i n tqual i ty scalc. Th e final resul t of such tests i s ana v e r a g e d j u d g e m e n t . t h e M O S . T w o 5 - p o i n tadjectival grading scales are in use, on e for signalqual i ty and the o ther fo r signal impairme nt , wi than associated numbe ring [7]. In the 5-point CC IRimpairment scalc a M OSvalue of 5 refers to an impcr-ccpt iblc impairment , where as a M OS value of 4 isdefined by a perceptible, but not annoying impair-ment. etc. The impairment scale is extremely use-ful if coders with only small impairments have tobe g raded . The IS OiMPEG t est s have shown tha tt r i p l e s t i m u l u s i h i d d e n r e f e r e n c e i d o u b l e b l i ndtests, based o n such MO S evaluations, can lead tovery reliable results; in addition, sm all differencesin qua l i ty beco me de t ec t ab l e . I n t hese t es t s t he

    subject is offered th ree signals, A, B, and C (triplestimulus). A is always the un processed source sig-na l ( t he re fe rence) . B and C , o r C and B , a re t hereference and the system undcr test (hidden ref-e r e n c e ) . T h c s e l e c ti o n i s n e i t h e r k n o w n t o t h esubjects nor to the condu ctor(s) of the test (dou-ble blind test). The subjects have to decide if B orC is the reference an d have to grad e the remain-ing one.Th e advantage of M OS values i s that di fferentimpairment factors can be assessed simultaneous-ly, and that even smal l impairmen ts can be grad-e d . H o w e v e r , o n t h e n c g a t iv e s i d e , M O S v a l u e svary with t ime and from l istener pane l to l i stenerpane l , and i t seems t o be very d i f f i cu l t t o d up l i -catc test results at a different tes t site. In the caseof aud io signals, MOS values also depend st rong-ly on thc selected test i tems, and ther e a re signifi-cant di fferences between MOS values obtainedwithaverage audio material and those obtained wi thmost critical test i tems. Finally, quality scale andimpai rment sca l e M O S values a re no t compara-b l e . F o r al l t h e s e r e a s o n s , c a r e i s n e e d e d w h e ncomparing results between different experiments.

    On t he o ther hand , i t shou ld be po in t ed ou t t ha tt he M PE G and C CIR l i s t en ing tes ts , ca r r ied ou tunde r very s imi la r and care fu l l y def i ned condi -tions with experienced listeners, have shown verysimilar and stable evalua tion results.Objec t i ve assessment s shou ld corre l a t e w i thhum an per ception for all distortions ikely to be foundin the coding algorithm a nd t ransmission or stor-age system. Perception-based measures ma ke use ofmasking thresholds derived from the input signalinord er to compa re them with the actual codingnoiseof the code r. Recent resul ts have shown that suchmeasure s can give high correlat ions between sub-j e c t i v e M O S s c o r e s a n d o b j e c ti v e s c o r e s . F o rexample , t he percep tua l au d io qua l it y measureh a s b e e n a p p l ie d t o a u d i o s i g na l s in t h e C C I RDigital Sound Broadcasting tests and hasgiven ac or-relat ion of 0.98 wi th a s tanda rd deviat ion of 0.17[8].Ano ther set of parameters, including ocal noise-t o - m a s k r a t i o s a n d a v e r a g e s o v e r a l l c r it i c a lbands, has proven to be easily implemen table and tobe acc ura t e enough t o be usefu l in coder des ignan d eva lua t i on 191 . I n t h e C C I R a u d i o c o d i n gt e s t s t h e c o r r e l a t i o n w a s 0 .9 4 w i t h a s t a n d a r ddeviation of 0.27.Wideband Speech CodingThe CClTT StandardSpeech cod ing wi th a bandwid th w ider t han t ha to f fe r ed i n t e l eph ony resu l t s in major improve-ment s i n represe n t ed speech qua l it y [ I , 101. Th eC C I T T G.722 wideband speech coding algori thms u p p o r t s b i t r a t e s o f 6 4 , 5 6 , a n d 4 8 k b i s. T h ecodec can be integrated on on e chip and i t s over-all delay is arou nd 3 ms, small enough t o cause noecho problems in telecommunication networks.In t h e C C I T T w i d e ba n d s p e e c h c o d e r a s u b -band spl i tt ing, based on two identical quadra turemirror (bandpass) fil ters (QMF ), divides he 16kHzsampled 14 b P C M r e p r e s e n t a t i o n o f t h e w i de -band input signal into two cri tical ly subsam pled

    38 IEEE Communi ca t ions Magaz ine November 1093

  • 8/3/2019 Wide Band Spech and Audio Coding

    6/11

    (8 kHz sampled) components, called low subbanda n d h i g h s u b b a n d ( F i g . 4). Th e f i l t e r s o v e r lap ,a n d a l i a s i n g w il l o c c u r b e c a u s e o f t h e s u b s a m -plingof each of the components; the synthesis Q M Ff i l t e rb an k a t th e r ece iv e r en su re s th a t a l i a s in gproducts are canceled. However, quantization errorc o m p o n e n t s o n t h e t w o s u b b a n d s w i ll n o t b eeliminated . Therefore , 24-tap Q M F fil ters with as top-band a t tenuation of 60 dB are employed.Th e coding of thc subband s ignals is based ona mo d i f ied v e r s io n o f th e 3 2 k b ls CC ITT G .7 2 1AD PC M speech coder. Input samples are adaptivelyp r e d i c t e d , t h e p r e d i c t i on e r r o r ( o r d i f f e r e nc e )s ig na l i s q u an t ized an d t r an smi t ted . Th e p red ic -tor is backward-adaptive, i.e., the predictor coef-f i c i c nt s a r e u p d a t e d s a m p l e - w i s e u n d e r t h ec o n t r o l o f t h e a l r e a d y c o d e d d i f f e r e n c e s i g n a lthat is a lso available a t the decoder. T he predic-tor usesa pole-zerostructurew ith sixzcroe s and twopoles . I t combines good predic t ion gain (with aneq u iv a len t g ain in o v e ra l l SNR ) an d s imp le s ta -b i li ty co n t ro l . Th e q u an t iz e r i s a l so b ack ward -a d a p t i v e a n d c a n r a p i d l y a d a p t i t s e lf t o t h echanging statistics of speech signals. After trans-mission errors , both predic tor and qu antizer con-verge ( in the long term ) to identical values onceno m ore transmission errors are observed.High quali ty coding with the G .722 widebandspeech co der is provided by a fixed bit allocation,wh e re th e low an d h ig h su b b an d AD PC M co d e rsu s e a 6 b i s a m p l e a n d a 2 b i s a m p l e q u a n t i z e r ,respectively. In the low subband the signal resem -bles the narrow band spee ch signal in most of its prop-ert ies . A reduction of the quantizer resolution to5 o r 4 b isamp le i s p o ss ib le to su p p o r t t r an smis -s ion a t lower ra tes or of auxiliary data a t ra tes of8 an d 16kbis . Embedd ed encodingis usedin the lowsubband A DP CM coding, i .e ., the adapta tion s ofpredictor and quantizer are always based only on thefour most significant bits ofeach AD PCM codeword.Hen ce a s tr ipping of on e or two least s ignif icantbits from the AD PCM codewords does not affectthe adaptatio n processes and cannot lead to mis-tracking effects otherwise caused by different decod-ing processes in transmitter and receiver.Av e rag ed M OS v a lu es f ro m te s t s co n d u c tedworldwide in seven laboratories, with loudspea kerr e p r c s e n t a t i o n of wid eb an d sp eec h wi th sev endifferent languages. are shown in Table 3. The unc od-cd source has the samc MO S rating as th e 64 kbJsv e r s io n . i m p l y in g n o m e a s u r a b l e d i f f e r e n c e i nsubjective quality. W c notice agraceful degradationof subjective quali ty with decrea sing b it ra te andi n c r e a s in g b it e r r o r r a t e . T h e c o d e r i s e r r o r -robust a t the two ra tes i t had been tes ted for; i tssubjective performance at a bit erro r rate of 0.001still has a fa ir rating, close to the sc ore of 64-kb/sP C M .Beyond the Wideband Speech CodingS anda dWith the introduction of narrowband ISD N, low-cost video telephony bit rate s lower than 48 kbls, theminimum bit rate of G.722, are needed forw idebands p e e c h . O n e a p p r o a c h h a s b e e n t o u s e 3 2 kb/sadaptive transform coding with Huffman coding,p e rcep tu a l ly -b ased n o ise sh ap in g . an d d y n amicbit allocation [ l 11. In an o th e r ap p ro ach a sp eecha n a l y si s - b y- s y n t he s i s t e c h n i q u e h a s b e e n f o l -lo wed : a lo w-d e lay 3 2 k b ls co d e -ex c i ted l in e a r

    , G.722 56 kbls 4.3 3.7

    I 64 bls

    4.0

    HTable 3. MOS values of C C I T T G.722 Wideband Speech Coder ill. Th eM OS scores result from loudspeakerpresentations.

    1 G.722 48 bls I 3.7 I 3.7 I 3.7 11 G.722 64 bls I 4.0 I 4.1 I 4.1 1HTable 4.M O S values of 16 kbls widebandspeech coders [ I S ] .predictive (CE LP) coder, similar to the recent C CI7 TG.728 coder for narrowband speech but with an LPCfil ter orde r of only 32 ( instead of 5 0 ) , has shownan averagerating in subjective ests essentiallyequalto tha t of the 64 kb/s G.722 coder , suggesting thati t sMOSvaIueisabove4.0[12].Recent lymosteffortshave gone in to reducing the b it ra te even furth erto 16 kbls and below - t th e ex p en se o f h ig h e rd e lay s . Th e v a lu es in Tab le 4 r e su l t f ro m ex ten -sive subjective tests for loudspeaker listening (thev a l u e s f o r h i g h q u a l i t y h a n d s e t l i s t e n i n g a r es l ig t h ly l o w e r , 0 . 3 p o i n t s o n t h e a v e r a g e ) [13].Th e re su l t s sh o w th a t ma le v o ice s o b ta in a su b -jective score a t a ra te of 16 kbls, which is close tothat of the C CIT T G.722 performance a t 56 kbis ;wh e reas th e p e r fo rm an ce wi th fema le v o ice s isb e low th e CC ITT G.7 2 2 p e rfo rman ce a t 48 kbis.Finally , it should be note d that MP EG , in i ts cur-rent pha se 2 , has a lso addressed codings a t lowersampling rates (se e below).Coding of Audio Signals

    i r st s t ep s to r ed u ce au d io b i t r a te s h av e b eenF b a s ed o n t e c h n i q u e s o f i n s t a n t a n e o u s c o m -p a n d i n g ( e . g . , a c o n v e r s i o n of u n i f o r m 1 4 bi tPCM into an 11bit nonuniform PCM pre sentation),a n d on v a r io u s fo rms of b lo ck -co mp an d ing su cha s 1 6 t o 14 b i t s ca l in g in d ig i ta l s a te l l i t e b ro ad -casting systems (CC ITT R ec. J41,542). The BBChas used the near-instantaneously ompanded audiomultip lex (NIC AM ) technique for the transmis-sion of sound in broad cast television networks. Suchc o d e r s p r o v i d e a s u f fi c i e nt d y n a m i c r a n g e f o ra u d i o c o d i n g , b u t t h e y d o n o t r e d u c e b i t r a t e sefficiently since they neith er exploit statistical de pen-d en c ie s b e tween sam p le s n o r au d i to ry mask in geffects. Bit rate reductions by fairly simple mean sare achieved in the interactive C D (CD -I) which sup-p o r t s 1 6 b it PCM a t a s amp l in g ra te o f 4 4 .1 k Hzand a llows for three levels of Adaptiv e Differen-t i al P C M ( A D P C M ) w i t h r e s o l u ti o n s of 3 7 . 8kHzi8 bit, 37.8 kHzi4 bit, and 18.9 kHzi4 bit.A good t o excellent audio coding performan cehas been obtained more recently with various fre-quency domain coders , both in the c lasses of sub-band coding (SBC) an d adaptive transform coding(ATC) . T h e d i f fe ren ces b e tween th e se p ro p o sedI E E E C o m m u n i c a t i o n s M a g a z i n e N o v e m b e r 199.3 39

  • 8/3/2019 Wide Band Spech and Audio Coding

    7/11

    I I 1 I 1-heISOIMPEGAu dio codingstandardconsists ofthree layers(codes)ofincreasingcomplexityandimprovingsubjectivepegormance.

    coders are in the number of spectra l componentsand in th e s tra tegies for an eff ic ient quantizationof spectra l components and masking of the result-i n g c o d i n g e r r o r s . F r e q u e n c y - d o m a i n c o d i n go f fe r s a mo re d i rec t way th an p red ic t ive co d in gfor noise shaping and sup pression of frequency com-ponents that ne ed not to be transmitted . In thesecoders the source spectrum is sp li t in to frequencybands, each frequency com pone nt is quantized seper-a te ly . There fore , the quantization noise associa t-ed with a part icular band is conta ined with in thatb a n d . T h e n u m b e r o f b i ts u s e d t o e n c o d e e a c hfrequency compon ent varies: components being sub-j e ct i ve l y m o r e i m p o r t a n t , a r e q u a n t i z e d m o r efinely, while compon ents b eing subjectively lessimpor tant , have fewer b its a llocated , or may notbe encoded a t a l l . A dynamic bit a l location has t ob e e m p l o y e d t h a t i s c o n t r o l l e d b y t h e s p e c t r a ls h o r t - t e r m e n v e l o p e of t h e s o u r c e s i g na l , a n dt h e r e f o r e b i t a l l o c a t i o n i n f o r m a t i o n h a s t o b et r a n s m i t t e d t o t h e d e c o d e r e f f i c ie n t ly a s s i d einformation.O n e f r e q u e n cy d o m a i n c o d e r t h a t h a s b e e napplied sucessfully to coding of au dio s ignals isthe a lready described subband-based C C IT I G.722wideband speec h coder . I t provides a good t o fa irquality and has b een usedw ith sampling rates of both16 and 32 kHz [l , 141.More recently , t ransform-based audio codingschemes have been proposed and tes ted . On e exam-ple is Dolbys 128 kbls AC-2 c oder [15] , a mo di-fiedversion ofwhich has been evaluated in the C CI Rprocess of digital audio broad cast standardizati onand has shown to be c lose in perform ance to th eI S O / M P E G L a y e r I 1 a u d i o c o d i ng a l g o r i th m a ti t s f ix e d b i t r a t e . A s e c o n d e x a m p l e i s A T & T sPerceptual Audio Coder (P AC) that extends the ideaof perceptual coding to stereo pairs. It uses both L/R(leftiright) and M /S (s uddiffere nce) coding, switchedin both frequency and t ime in a s ignal dependentfashion [161. Subjective tests have shown an aver-age increase in MO S score of 0.6. over the dual mono-phonic mode for the sam e bit ra te .S o n y s A d a p t i v e T r a n s f o r m A c o u s t ic C o d e r(AT RA C) h as b een d ev e lo p ed fo r p o r tab le d ig i-tal audio, specifically for Sonys magneto-opticalMiniD isc (M D) (171. The c oder uses a hybrid fre-quency m apping employing a s ignal-spli tt ing in tothreesubbands(0-5 .5 ,5 .5 - l l .0 ,and 11.0-22.0kHz)f o l l o w e d b y s u i t a b l e d y n a m i c a l l y w i n d o w e dM DC T t ran s fo rms ( su ch a t e ch n iq u e , d e sc r ib edbelow, is also applied in Layer 3 of the IS O M PE GAudio coder).ISOIMPEG StandardizationThe recent ISO /MP EG Audio coding s tandard com-b i n e s f ea t u r e s of M U S I C A M [18] a n d A S P E C[19]. Its main fe atures will be described in th e fol-lowing leaving ou t such imp ortan t issues as framing,sy n ch ro n izat io n , e r ro r p ro tec t io n , e tc . Th e s tan -dardization process included extensive subjectivet e s t s a n d o b j e c t i v e e v a l u a t i o n s of p a r a m e t e r ssuch as complexityand overall delay. The total nu m-ber of subjects (expert l is teners) was aroun d 60,approximately ten test sequencesw ere used, and thesessions were perfo rme d in stereo with both loud-spea kersa nd headphones [20,21]. It should be notedthat critical test itemsw ere chosen in the tests o eval-uate the code rs by theirworst case (not average) per-formance.

    The Basics of the ISOIMPEG AudioCoderOperating Modes- h e I S O / M P E G A u d ioco d in g s tan d a rd co n s is t s o f th ree lay e rs ( co d es )of increasing complexity and improving subjectivep e r f o r m a n c e . I t s u p p o r t s s a m p l i n g r a t e s o f 3 2 ,4 4 .1 , an d 4 8 k Hz , an d b i t r a te s p e r m o n o p h o n icch an n e l b e tween 3 2 an d 192kbls, or p e r s t e r e o -p h o n i c c h a n n e l b e t w e e n 128 an d 3 8 4 k b ls . Th es t a n d a r d o f f e r s t h e s i n g l e c h a n n e l m o d e , t h es te reo mo d e , th e d u a l ch an n e l mo d e ( to p ro v id ebil ingual audio program s), and the optional jo in ts te reo mo d e . In th is l a t t e r mo d e th e two co d e rsf o r t h e left a n d r i g h t c h a n n e l c a n support e a c hother by exploit ing s ta t is t ica l dependencies andirrelevancies between these channels to compressthe audio b it ra te to a n even higher degree than ispossible in mono phon ic transmission. A so-calledin tensity s tereo mod e is optional in the ISO/M PEGs t a n d a r d ; if i t is used, i t wil l only be effectiv e ift h e r e q u i r e d b i t r a t e e x c e e d s t h e a v a i l ab l e b itra te , and i t wil l only be applie d to subbands cor-responding to high frequencies.Frequency Mapping- n th e en co d e r th e 1 6 -b i t P C M f o r m a t a u d i o s i g n al i s w i n d o w e d a n dconverted in to spectra l subband com ponen ts viaa p o ly p h ase f i l t e rb an k co n s i s t in g o f 3 2 eq u a l lyspaced ban dpass f i l ters (Fig . 5 ) . Such f i l terbanksperfectly cancel t he a l ias ing of adjacent overlap-ping bands in the absence of quantization errrors ;they ar e computationally very eff ic ient , s ince anFF T can be used in the f i l tering process; and theya r e o f m o d e r a t e c o m p l e x it y a n d l o w d e l a y. O nth e n eg a t iv e s id e , th e f i l t e r s a r e eq u a lly sp aced ,an d th e re fo re th e f r eq u en cy b an d s d o n o t co r re -sp o n d we l l to th e c r i t i ca l b an d s a t lo w f req u en -cies. The filters used ar e of order 5 11,which impliesa n i m p u l s e r e s p o n s e o f 5 . 3 3 m s l e n g t h ( a t 4 8k H z ) ; t h e y a r e d e s i g n e d f o r a h i g h s i d e l o b eattenua tion exceeding 96 dB th at is necessary fors u f f i c i e n t c a n c e l l a t i o n of a l i a s in g d i s t o r t i o ncau sed b y q u an t iza t io n n o ise . Th e sh ap e o f th ef i l te r imp u lse r e sp o n se su p p o r t s t emp o ra l mask -in g of p re -ech o es in ca se o f an a t t ack s ig n a l ( a sdescribed below).Th e filtered bandpass ou tput signals are criticallysubsampled (decimated), i.e., they are sampled ata ra te that is twice the nomin al bandwidth of thebandpass f i l ters . At a 48 kHz sampling ra te , eachband has a width of 75 0 Hz and th e sampling ra teof each decimated sub band is 48/32 = 1.5 kHz.There-fore we have as many frequency domain samples persecond as t ime domai n samples . In the receiver ,the sampling ra te of each su bband is increased tothat of t he sou rce signal by filling in the ap propri-a t e n u m b e r o f z e r o s a m p l e s , a n d i n t e r p o l a t e dsu b b an d s ig n als ap p ea r a t th e b an d p ass o u tp u tsof the synthesis filter bank.I n L a y e r I 1 1 a h i g h e r f r e q u e n c y r e s o l u t i o nc lo ser to c r i t i c a l b an d p a r t i t i o n s i s a ch iev ed b ys u b d i v id i n g t h e 3 2 s u b b a n d s i g n a ls f u r t h e r i nf req u e n cy co n te n t by ap p ly in g a 6 -p o in t or 1 8 -p o i n t m o d i f i ed D i s c r e t e C o s i n e T r a n s f o r m( M D C T ) w i t h 5 0 p e r c en t o v e r l ap to each o f th esubbands(Fig.7). Th e maximum num ber of frequencyco m p o n en ts in Lay e r I11 i s th e re fo re 3 2 x 18 =576, each representing a bandwidth of 24000/576= 41.67 Hz.

    40 IEEE Comm unicat ions Magazine November 1993

  • 8/3/2019 Wide Band Spech and Audio Coding

    8/11

    Figure 5 . Block str-irctureof ISOJMPEGAitdio ericoder and decoder, Layer~s arid II .QUJntiZJtiOn and Bit Allocation- n each oft h e 2 7 l o w e s t s u b b a n d s us e d i n t h e e n c o d i n g .blocks of 12 decim ated samples are form ed and block-co m p an d ed , i . e . , d iv id ed b y a sca le fac to r su cht h a t t h e s a m p l e of l a r g e s t m a g n i t u d e is u n i ty .E a c h b l o c k c o r r e s p o n d s t o 1 2 x 32 = 3 8 4 in p u tsamples; which corresp onds to 8 ms of audio a t asampling ra te of 48 kHz. The choice of th e b lock-l e n g t h i s a f f e c t e d b y t w o c o n f l i c t i n g r e q u i r e -m e n t s : o n o n e h a n d , l o n g e r b l o c k s r e d u c e t h eside information b it ra te , andon the other hand pre-maskingisonlyeffective with short blocks (see below).Each spectra l component is quantized where-by the num ber of quantiz er levels for each com -ponent is obta ined from a dynamic b it a l locationrule that is controlled by a psychoacoustic model.T h e m o d e l c o m p u t e s S M R , th e g lo b a l mask in gthresh old , for each 12-sample block v ia an FFT( F i g . 2 ) . T h e b i t a l l o c a t i o n a l g o r i t h m s e l e c t sthen o ne uniform midtrea d quantizer out of a se tof available quantizers such that both the bit rater e q u i r e m e n t a n d t h e m a s k i n g r e q u i r e m e n t a r em e t . T h e p r o c e d u r e s t a r t s w it h t h e n u m b e r o fb i ts fo r th e samp le s an d sca lc fac to rs s e t to ze ro .I n e a c h i t e r a t i o n s t e p t h e q u a n t i z e r s i g n a l- t o -noise ra t io SNR(m) of that subband quantizer isincreased that contr ibutes most to an im provedper-formance. Fortha t purpose , the noise-to-mask ratioNMR(nz) = S M R - SNK(nz) is ca lcula ted as thedifference ( i n d B) b e tween th e ac tu a l q u an t iza -tion noise level and the minimum global maskingth re sh o ld (F ig . 2 ) . Th e q u an t ized sp ec t ra l su b -band com ponents are then tran smitted to the receiv-e r to g e th e r wi th sca le fac to r an d bi t a l lo ca t io ninformatio n. Note that the psychoacoustic m odelis only needed in th e cn co d e r , wh ich mak es th ed e c o d e r l e s s c o m p l e x , a d e s i r a b l e f e a t u r e f o raudio playback and a udio broadcasting applications.Two psychoacoustic models, both based on auditorym a s k i n g a s d e s c r i b e d a b o v e , a r e g i v e n i n t h ein fo rma t iv e p a r t o f th e s tan d a r d ; b e t te r o r s im-pler models may be used instead.

    Pre-Echo Control- crucial part of frequencydomain coding of audio s ignals is the appe aranceo f p r e - e c h o e s . C o n s i d e r t h e c a s e t h a t a s i l e n tperiod is followed by a percussive sound, such asfrom castanets o r tr iangles , with in the s ame cod-ing block. Such an at tack will cause com parablylarge instantaneous quantization errors . In A T C ,t h e i n v e r s e t r a n s f o r m o f t h e d e c o d e r w il l d is -tribute such er rors over th e block; similarily, in SBC,t h e d e c o d e r b a n d p a s s f i l t e r s w il l s p r e a d s u c he r ro r s . In b o th m ap p in g s , p re -ech o es o ccu r an dcan b eco me d i s t in c t iv e ly au d i b le , e sp ec ia l ly a tlo w b it r a te s wi th co mp arab ly h ig h e r ro r co n t r i -butions. By the t im e domain effec t of pre-mask-i n g , p r e - e c h o e s c a n b e m a s k e d i f t h e t i m es p r e a d i s o f s h o r t l e n g th ( i n t h e o r d e r o f a f e wmilliseconds).ISOIMPEG LayersLayer I - ig . 5 h a s a l r e a d y s h o w n t h e b l o c ks t r u ct u r e of t h e I S O i M P E G A u d i o e n c o d e r a n dd e c o d e r f o r L a y e r s I a n d 1 1 . T h e L a y er I c o d e ruses fixed subban d blocks containing 12 decimatedsamples . Each scalefactor is repre sente d by 6 b an dis transmitted for each subba nd block unless theb i t a l l o c a ti o n r u l e i n d i c a t e s t h a t t h e s u b b a n dblock and i ts scalefactor need not be tran smitteda t a l l . Fo r each 1 2 -samp le b lo ck th e SM R is ca l-cu la ted v ia a 5 1 2 -p o in t FFT. F o r e a c h s u b b a ndthe bit allocation selects one uniform midtread quan-tizer out of a se t of 15 quantizers with M = 2m 1levels (ni = 0 o r m = 2 . . . 15 b). 4 b a r e n e e d e dper block for the bit allocation information.Th e d eco d in g i s s t r a ig h t fo rward : th e su b b an dsequences are reconstructed on the basis of 12-sam-ple subband blocks taking into account the decod-e d s c a l e f a c t o r a n d b i t a l l o c a t i o n i n f o r m a t i o n .E a c h t i m e t h e s u b b a n d s a m p l e s of a l l 3 2 s u b -b an d s h av e b een ca lcu la ted , th ey a r e ap p l ied tothe synthesis filterbank, which includes also inter-p o la t io n an d win d o win g o p e ra t ion s , an d 3 2 co n -s e c u t i v e 1 6 b P C M f o r m a t a u d i o s a m p l es a r e

    -crucialPa* offrequencydomaincodingof audiosignals is theappearanceof pre-echoes.

    IEEE Communica t ion\ Magaz ine Novemhcr lW3 41

  • 8/3/2019 Wide Band Spech and Audio Coding

    9/11

    -heISO/MPEGstandardiza -tion processis no w in itsPhase 2;an I S 0committeedrafi will becompletedin November1993.

    Figure 6.M O S resultr of ISOIMPEG Aud io Coder Layer II at a rate of 128 kbis per monophonic chan-nel. The M O S vahtet are shown as harplo ty, with three lines on the top. The middle one indicates the mea nvalue, the two remaining ones represent th e mean p ludminus the 95% confidence k L d . Subjective testsincluded 10 test item5 and 58 tubjects. Th e item AN LJ the average over all ams sm ent s. All valuer are aver-aget over lo ud p ak e i and headphone pretentations [20/ .calculated.In the ISO/MPEG subjective tests this Layer Icodec had a mean M OS valuc (over 10 test items)of around 4 .7 at a rate of 192 kbis per mon ophon-icchanne1,with aworst-case meanvalue f oro ne itemstill above 4.4.Layer I /- he ISO/M PEG Audio Layer I1 coderis basically s imilar to th c Layer I cod er but has ah ig h e r co mp lexi ty an d ach iev es a b e t te r p e r fo r -m a n c c a c c o r d i n g t o t h r e e m o d i f i c a ti o n s . F i r s t,the input to th e psychoacoustic m odel is a 1024-pointFF T lead in g to a f in e r f r eq u en cy re so lu t io n fo rthe calculation of the global signal-to-mask ratio.Second, the overall scalefactor side information isreduced by a factor of arou nd 2: in each subban d.blocks of 12 samples are fo rmed and scalefactorsof three adjacent 12-sample b locks are calculated(which implies that 3 x 12 x 32 = 1152 input sam-plesare taken in to account) . Dependingo n their re l-ative values only one. two, or all thre e scalefactorsare transmitted. O nly one scalefactor has to be trans-mitted if the differen ces are relatively small and onlythe first one of two adjacent scalefactors has to betransmitted if thc sccond on e has a smaller value ,su ch th a t p o s t -mask in g can b e ex p lo i ted . I n caseo f l a rg e d y n amic ch an g es , a l l s ca le fac to r s mayhave to be used. Thc selected scalefactor or scalc-factors are again r epre sente d by six bits each. Thepatte rn of the transmitted scalefactorswill be codedas 2 bisubband side information called scalefactors e l e c t i n f o r m a t i o n . T h i r d , a f i n e r q u a n t i z a t i o nwith up to 16 b amplitude resolution is provided(wh ich red u ces th e co d in g n o ise ) . On th e o th e rhand, the number of available quantizers decreas-e s wi th in c rea s in g su b b an d in d ex (wh ich k eep sthe s ide information small) . Th c decoding fo llowsth a t o f Lay e r I . D u e to th e sca le facto r s e lec tio nprocess, the descaling has to be based on 3 x 12 =36 subband samples hence in troducing addit ionaldelay . Th e to ta l delay (without processing delay)of the Laycr 11code c is 45 ms at 48 kHz sam plingrate.Fig. 6 shows MOS values of the Laycr I1 codcc.as measu red in th e ISO iM PEG su b jec t iv e te s t s .a t a r a t e o f 1 2 8 k b i s p e r m o n o p h o n i c c h a n n e l .Th e m ean MO S v a lu c (o v e r a ll i t ems) is a ro u n d4.8, the worst-case mean value is around 4.6.

    Layer/ / /-Fig ure 7shows the b lockstructureofth eI S O i M P E G L a y er I11 A u d i o c o d e r t h a t i n t r o -d u c e s m a n y n e w f e a t u r e s w h i ch a r e n o t p a r t o fLayers I an d 11. Th e co d c r ach iev es a b e t te r p e r -formance, especially at low bit rates (64 kbis permo n o p h o n ic ch an n e l ) d u e to an imp ro v ed t ime-to-frequency mapping, an analysis-by-synthesisap p ro ach fo r th e n o ise a l lo ca t io n , an ad v an cedp r c - e c h o c o n t r o l , a n d f i n a l l y b y n o n u n i f o r mquantizatio n with entropy coding. A higher frequenc yresolution is achieved by employing a hybrid fil-t e rb an k , a ca scad e o f th c p o ly p h ase f i l t e rb an kand a dynamicallywindowed MDC T ransform. Thed y n a m i c w i n d o w s w i t c h i n g [ 1 4 , 221 a l l o w s t oswi tch f ro m a h ig h e r f r eq u en cy re so lu t io n (18-point MDCT corresponding to 12 ms of audio) to alo wer f r eq uen cy re so lu t io n (6 -p o in t M DC T co r -responding to 4 ms of audio) for subbands above ac h o s e n i n d e x w h e n a h i g h e r t i m e r c s o l u t i o n i snecessary in order to contro l t ime art ifacts (pre-echoes) during nonsta t ionary periods of the s ig-nal. In princip le , a pre-ec ho is assumed, when anin s tan tan eo u s d ema n d fo r a h ig h n u mb er o f b i tsoccurs.Th e M DC T o u tp u t s amp le s a re n o n u ni formlyquantized , thus providing both smaller mca n-square de r ro r s an d mask in g , i . e ., e r r o r s can b e la rg e r i fthe samples to be quantized are large. Huffma neod-ing based o n 32 tabula ted c ode tables is appliedto represent the qu antizer indices in an cfficient way.In addition, run-length codi ng ofzero value sequencesincreases the eff ic iency. A b u f fe r map s th e v a r i -able wordlength codewo rds of thc Huffman c odctables in to a constant b i t ra te . A buffer feedbackis u sed to p rcv en t th e b u f fe r f ro m o v e r f lo w. Inorder to kecp the quantization noise in a l l cr i t ica lb a n d s b e l o w t h e g l o b a l m a s k i n g t h r e s h o l d(noise allocation) an iterative analysis-by-synthe-s i s me t h o d i s emp lo y ed wh e reb y th e p ro cess o fscaling, bit allocation, quantization and coding ofspectral data is carried o ut within two nested iter-ation loops.

    T h e d e c o d i n g fo l l o w s t h a t cif t h e e n c o d i n gp r o c e s s . A t a r a t e o f 6 4 k b i s p e r m o n o p h o n i cc h a n n e l t h e m e a n M O S v al u e s ( o v e r all i t e m s )for Lay ers I1 and 111, as measured in ISO iMPEG sub-jec t ivc tes ts , are arou nd 3.1 and 3.7, respectively.Obviously the higher complcxity of the Layer I11

    42 I E E E Communica t ions MngaLine N o v e m b e r 10c13

  • 8/3/2019 Wide Band Spech and Audio Coding

    10/11

    code r pays off a t low hit rates. A t a 128 kbis joints t e r e o h i t r a t e s e v e n of c i g h t test i t e m s h a d aMO S value of 4 and above.ISOIMPEG Phase 2Th e ISOIM PEG s t andard i za t i on p rocess i s nowi n i ts P h a s e 2 ; a n I S 0 c o m m i t t e e d r a f t w il l b ecompleted in Novcmbcr 1993. Emphasis in the audiocoding part of the new activity is on multichanneland multil ingual audio and on an extension of thecxisting standar d to lower sampling frcquencies (16 ,22.05, 24 kHz.) and lowcr hi t ratcs. Th e work on3I2-stereo (multichanncl coding) is carr ied out inc lose coopera t i on w i th CCI R . Backwards com-p a t i b i l i t y to t h e t w o - c h a n n e l (210.) M P E GPhascIiAudiowil l be takenintoacco unt ,so that thatdeco der wi ll del iver a corrcct front left and rightc h a n n c l d o w n m i x f r o m t h e m u l t i c h a n n e l b it -stream. In addition. a downmix t o 3/0- ,2i2-stercoformats, and even to a 1/O-monophonic channcl ,should be possible.Beyond t he Existing Audio CodingStandardThe ISOiMPEG standard i s the fi rst standard inaudio coding (besides the quasi-standard of the C D).It isworthwile to no te that its normativc p art describesthc decode r and the meaning of the encode d bi t-stream, hut that the encoder isnot dcfincd, thuslcav-ing room for an evolutionary improvement of thecnco dcr. In particular, different psychoacoustic mod-clscan bc used rangingfromvery simple ones (includ-ing none at all) tovery complexones bascd on q ualityand implementabi l ity requircnicnts (the standa rdgives two examples of s u c h m o d c l s ) . T h e r e f o r ewewill sec in the future diffcrcnt solutionsfor encod-ing. In addi t ion. a b ct tcr und crstanding of binau-r a l p e r c e p t i o n a n d o f s t e r e o a n d m u l t i c h a n n e l

    presentation will lead to ncw proposals for a be t -ter use of the joint stereo mode providcd by thes t andard .Several activities are already underway in highqual i ty cod ing at lower bi t rates. Phas e 2 and, inparticular, the future Phase 4 of the ISOiMP EG stan -dardization process add rcss such codings at low andvery low bit rates. A very attractive application is thet ransmi ss ion o f a s t e reo aud io s i gna l over a 64kh/s basic rate cha nncl, or even o ver transmissionchanne l s w ith in t he CCI TT fas t modem pro j cc t ,w h i c h w i ll s u p p o r t d i g it a l t r a n s m i s s i o n a t b i trates up to 24 kbis over the exist ing analog tele-phone network.We can a l so expcc t fu r t h er ac t i v i ti es in t hefield of digital multichan nel surrou nd systems. Ongo-ing rescarch will result in enhanced stereophonicrepre sen t a t i ons by making use o f i n t e rch anne lcorre l a t i ons an d i n t e rchanne l mask ing e f fec t s ;stcreo-irrelevant com pon ents of the multichannels i g n a l m a y h e i d e n t i f i c d a n d r e p r o d u c e d i n amonophonic fo rmat to bring the bi t ratcs furtherdown. In ad di t ion, thc potent ial of such systemst o d e li v e r n a t u r a l t h r e e - d i m e n s i o n a l s p a t i a lacoustical imageswill lead to ncw proposals. We canalso expect solutions for special presentations forpeo ple with impairm ents of hearing o r vision thatcan mak e use of the m ul t ichannel configurat ionsin various ways.

    Conclusionsowerful algori thms and stand ards for an effi -P i en t cod ing of w ideband spee ch and au d iosignals are now a vai lable to enhancc t he qua l i tyof services in com municat ion-b ased and storage-based audio-only and audiovisual applications. Inpart icular, the ISO iMPEG au dio coding standard

  • 8/3/2019 Wide Band Spech and Audio Coding

    11/11

    -uchlower bitrates willbe coveredby therecentlyinauguratedISOIMPEGPhase 4.

    (Layer 11) achieves subjectivequal i ty comparable tothatoftheCDatarateof256kblsforastereophonicsignal wi th an indepen dent coding of the left andr i g h t c h a n n e l . D i g i t a l s o u n d b r o a d c a s t i n g w i llmake use of this Layer I1 standard for the distributionand emission of the audio material . At 128kb/s,t h e L a y e r I1 1 s t a n d a r d p r o v i d e s t h e b e s t a u d i oqua l i ty fo r a s t e reo phon ic s igna l , w i th a qua l i tythat i s qui te close to that of the C D for many, butno t a l l aud io t es t s i gna l s . Much l ower b i t ra t e sw i ll b e c o v e r e d b y t h e r e c e n t l y i n a u g u r a t e dI S O M P E G P h as e 4, targeted at very low bit ratesof tens of kb and below for the coding of audiovi-sual nformation. New algorithms, ather than exten-s ions o f ex i s t i ng a lgor i t hms , w i ll be need ed t omeet this ambitious goal.AcknowledgmentT h e a u t h o r w o u ld l ik e t o t h a n k t h e f o l l ow i n gp e o p l e f o r t h e i r i n v a l u a b le c o n t r i b u t i o n s t o a nearlier version of this pape r: B. Bochow, K. Bran-denburg, H. Fuchs, N.S. Jayant, L. van de Ker khof,A. Sugiyama.References[ l .Mermelstein, "G.722,ANewCCllTCoding Standard or DigitalTrans-mission of Wideband Audio Sianals." / E Commun. Mao.. DD. 8-- . .15, Jan. 1988.121 ISO/IEC JTCVSC29, "Info rma tion Tech nology - C o d i n g o f M o v i n gPicturesandAssociatedAudio or Diai tal StoraaeMedia atUD o About

    1 . 5 M b i f f s - I S 1 1 1 7 2 ( P a r t 3 . A u d ~ ) " .[3 ] K. Brand enbu rg and G. Sto l l , "The ISO/MPEG-Audio Codec: AGener icstandard for Coding of HighQual i ty Digi tal Audio."92nd AESConvention, Wien. Marz 1992, Preprint no. 3336.[41 G. Theile, "The New Sound Forma t '3/2-Stereo'," 94t h Audio E ngi-neering Society Convention, Berlin, 1993.151 B. Schar f, "Cri t ica l Bands, " i n "Fou ndat ions of Mo dern Aud i toryTheory". New York, pp. 159 - 202.161E. Zwicker and R. Feldtkel ler, Das Ohr als Nachr ichtenempfange r.Stuttgart : S. Hirzel Verlag. 1967.[71 N. S. Jayant and P. Nol l , "Digi talCodin g o f Waveforms: Pr inciplesand Appl icat ions to Speech and Video," Prent ice Hal l , 1984.181J. G. Beerends and 1. A. Stemerdink, "A Perceptual Audio Qu al i ty

    Measure,"92ndAESConvention,Vienna(March1992). Prepr int3311.191K. Brandenburg and T. Sporer: "NMR" and "M asking Flag": Evalua-t ion o f Qual i ty Using Perceptual Cr i ter ia, 1 t h AES Internat ionalConference, pp. 169-179, Portland 1992 .[ l o ] N.S. Jayant. J.D. Johnston and Y. Shoham, "Coding of Wideban dSpeech," Speech Communicat ion 11 (1992). pp. 127-138.[ l l ] S. R. Quackenbush, "A 7 kHz Ban dwidth , 32 kbps Speech Coderfor ISDN," Proc. Internat. Conf. Acoust. Speech Signal Process,PaperS1. l , 1991.[12] E. Ordentlich, Y. Shoham, "Low-Delay Cod e-Excited Linear-Predic-t i ve Coding of Wide band Speech at 32 kbps," Proc. In ternat .Conf. Acoust. Speec h SignalProcess. 7991, paper S1.3.[13] A. Fu ldseth et a l ., "Wideband Speech Coding at 16 kb i f f s for aVideophoneApplication",SpeechCommunication1, pp. 139 148,1992.[14 ] M. Iwadare. A. Sugiyama. F. Hazu, A. Hirano, an d T. Nish i tani .' 'A128 kb/s Hi-Fi Audio CODEC Based on Ad aptive Tran sform Cod-ing w ith A dapt ive Block Size," IEEEJ. on Se/. Areas in Commun. ,vol. 10, no. 1, pp. 138-144, Jan. 1992.[15] G. Davidson, L. Fielder, M. Ant i ll , "HighQ ual i ty AudioTransformCod -ing at 12 8 kbitds," Proc. of he ICASSP 7990.(161J.D. Johnston an d A.J. Ferreira, "Sum-Difference Stereo TransformCoding," Proc. I lCASSP '92, pp.11-569 - II 572.[17] K. Tsutsui et al . , "Adapt ive Transform Acoust ic Coding for M ini-Disc," 93 rd AES-convention, San Francisco 1992. prep rint 3456.[18] G. Stol l and Y.F. Dehery, "High Qua l i ty Audio Bit- ra te Reduct ionSystem Family for Dif fe rent Appl icat ions, Proc. lEEE Intern. Conf.on Commun . ICC'90, Rec. Vol. 3, No. 322.3, pp. 937-941. 199 2.[191 K. B r a n d e n b u r g . J. Her re. J. D. Johnston, Y. Mahieux, E . F .Schroeder: "ASPEC: Adaptive Spectral Perceptual Entropy Codingof High Qua lity Music Signals", 90 th. AES-convention, Paris 1991,prepr int 301 1.[20 ] T. Ryden, C. Grewin. S . B e r g m a n , "T h e SR R e p o r t o n t h eMPEG/Audiosubjective istening tests n Stockholm ApriVMay 1991".ISO/IEC JTCVSC29MIG II: Doc.-No. MPEG 91/010, M ay1991.[211H. Fuchs,"ReportontheMPEG/AudiosubjectivelisteningestsinHan-nover", ISO/IEC JTCI/SC29/WGII: Doc.-No. MPEG 91/331, Nov.1991.[22] B. Edler, "Coding o f Audio Signals wi th Over lapping Block Trans-form and Adapt i ve Wind ow Funct ions, " ( i n German). Frequenz,vol. 43, pp. 252-256, 1989.

    BiographyPETER NOLL [F '821 is a professor o f tele comm unicatio ns at th e Techni-cal University o f Ber l in. His research interests are in wave form cod ingand com municat ion theory. He has autho red many technical papers inthose f ields and is coauthor of the book D igi tal Coding o f Waveforms:Pr inciples an d Appl icat ions to Speech and Video. He was a rmipie nt oft h e 1 9 7 6 N TG A w a r d ( G e r m an y ) a n d o f t h e 1 9 7 7 I E E E ASSP SeniorAw ard (IEEE Accoustics, Speech, an d Signal Processing Society). Since1991 he has chaired the Audio Subgroup within ISO/MPEG.

    44 IEEE Communications Magazine November 1993