Stereo Coding for Audio Compression

Stereo Coding for Audio Compression

Rui Wang, Harold Nyikal, James Yu

March 7, 2005

Abstract

A perceptual audio coder with stereo coding is implemented and reviewed. TheMid/Side (M/S) stereo coding is incorporated into the baseline perceptual coder. Thecoder was tested with various audio files ranging from music to speech at 128kbps.Stereo coding is shown to reduce much of the redundancy in stereo signals.

1 Introduction

Audio compression has been an increasingly important technique in audio tranmission andstorage. Perceptually lossless compression is achieved by exploiting psychoacoustic modelsthat discard information the human auditory system cannot perceive. This induces muchhigher compression ratios than the usual entropy coding. Usually, this is performed ina transform domain, where information is discarded by quantizing transform bins morecoarsely.

Joint stereo coding is an extension to the nominal block floating point quantization scheme.The main assumption in stereo coding is that the left and right channels of audio are highlycorrelated. This is usually the case for speech and music, since the two microphones arespatially close and used at the same time to record the same sounds. This strong correlationsuggests that there is redundancy in the stereo signal.

First, we review the perceptual model that is at the heart of the compression scheme. Then,the stereo coder is detailed and analyzed. Finally, compression quality results are provided.

2 Overview of Perceptual Model

The perceptual model is based on the characteristics of the human auditory system. Themodel eventually dictates the number of bits that will be assigned to be used for each

1

line in the frequency domain. By assigning different number of bits, the coder is essentiallyquantizing each frequency line with different levels of coarseness based on relative importance.

The steps of the perceptual analysis are as follows:

1. The signal is divided into blocks of size N.

2. The FFT is applied to each block.

3. The masking model is used to determine the SMR for each bark subband.

4. A waterfilling algorithm assigns the number of bits to be consumed for each subband.

5. The block is quantized using block floating point quantization.

The quantization actually occurs in the MDCT domain. Note that the signal is analyzed inthe FFT domain, and processed in the MDCT domain. The MDCT is usually the preferredtransform since it has some nice characteristics in terms of implementation, windowing, andaudio quality.

2.1 Masking

The main mechanism behind the audio perceptual coding model is masking. Masking is theact of one particular signal frequency component inhibiting the perceived strength of anotherfrequency component.

For each frequency line in the signal, a masking curve is derived in the bark space using

F (dz, LM) = (−27 + 0.37 max(LM − 40, 0)u(dz))|dz| (1)

where LM is the masker’s sound pressure level (SPL) in dB, u(f) is the unit step function,and dz is the distance to the masker. This shape looks like a triangle that has constantslope on the left, and shallowing slope on the right with respect to the masker’s SPL. Thisfunction is calculated for every frequency line, and the final masking curve is the point-by-point maximum of all the curves and the threshold for hearing curve.

The relative importance of each frequency line is determined by the signal to mask ratio(SMR), measured in dB. The actual bit allocation is performed on each subband, where asubband is determined by the standard bark scale. Therefore, a subband with a higher maxSMR are allocated more bits than lines with a lower max SMR.

Basically, signals that are below the masking threshold will be masked (or hidden) by another(usually stronger) signal component. Thus, we consider these signals to be less important inreconstructing the signal.

2

2.2 Bit Allocation

The bit allocation scheme used is based on block floating point. Each subband is assigneda scale factor which is applied to all the lines in the block. Each line in the block is thenassociated with a mantissa factor. The perceptual coder determines how many mantissa bits,Rb, a subband will receive. A bit pool P is determined for the block based on the desiredbitrate and the sampling frequency.

A waterfilling algorithm is used to assign bits to each subband based on the SMR. Thealgorithm is as follows

1. Determine the number of bits in the bit pool P , based on the desired bit rate, blocksize, and sampling frequency.

2. Sort the subbands by SMR

3. Add one bit to the band with highest SMR, or two bits if it is the first time is beingallocated.

4. Decrement the SMR by 6 dB × bits allocated.

5. Decrement the bit pool.

6. Go to step 2, and repeat until bit pool is emptied.

Once all the bits have been used, the block is quantized using the block floating pointquantization scheme and efficiently packed into the compressed file.

2.3 Decoding

The decoding algorithm is much less complex than the encoding scheme. This is desirablysince in many cases the decoder has less processing power than the encoder.

No psychoacoustic analysis needs to be performed for the decoder. Only the Rb values areneeded for the decoder to correctly dequantize the frequency lines. After that, the inverseMDCT is applied block by block and are overlap added to produce the reconstructed signal.

3 Stereo Coding

Joint stereo coding is an extension of the psychoacoustic model that takes advantage of thetypical high correlation that exists between the signal power spectra of the left and right

3

channels in stereo audio to improve coding gains. There are various ways of achieving thisin practice.

We chose to implement mid/side (M/S) stereo coding as outlined in [1]. In this method,instead of transmitting the left and right channels, the normalized sum (mid) and difference(side) signals are transmitted. Also, the left and right channel share a common bit pool.Depending on the signal, this can reduce the data rate of the signal by up to 50%.

For example, consider a stereo signal with identical left and right channels. The differenceof the two channels will be zero and thus the side information (all zeros) can be transmittedwith a single bit to say that it is all zeros. This frees up bits, allowing the mid informationto be transmitted with twice as many bits. In any case, the left and right signal can becompletely reconstructed at the decoder. Hence, the coding gain is roughly 50%.

Though stereo signals are seldom like the example above, the side information is usuallysmaller in value than either the left or right channels, suggesting a reduction in the numberof bits. However, cross-channel psychoacoustics play a big role in perception of stereo sound,and because the mid and the side both contain information on the left and right channels,we must perform bit-allocation for the mid and side information based on a cross-channelpsychoacoustic model.

Here, we detail the different components in stereo coding. The flow chart of the encodingalgorithm can be seen in Figure 1. Also, the decoding algorithm is shown in Figure 2.

3.1 M/S Decision

The first step in stereo coding is to decide whether to transmit data as left/right or mid/side.There are cases where there are no significant gains in transmitting mid/side information overleft/right in certain subbands. In cases like these, the left/right information is transmitted.Our decision for M/S is applied for each subband of the signal. The decision thresholds are

fhigher∑

k=flower

(l2k − r2k) < 0.8

fhigher∑

k=flower

(l2k + r2k) (2)

fhigher∑

k=flower

(l2k − r2k) > 0.8

fhigher∑

k=flower

(l2k + r2k) (3)

where lk and rk correspond to the FFT spectral line amplitudes computed in the psychoa-coustic model, and flower and fupper correspond to the lower and upper lines within a subband.If either of these conditions are met then M/S is transmitted, if not, then L/R is transmitted.This condition allows M/S transmission in cases where the mid and the side differ in energyby a certain threshold (in this case, 80%).

4

Figure 1: Flow chart of the stereo encoding algorithm.

5

Figure 2: Flow chart of the stereo decoding algorithm.

The values of M/S are calculated as follows

M =L + R

2(4)

S =L−R

2(5)

where L and R are the filter bank spectral line amplitudes. We can see that no actualinformation is lost in the transformation to M/S. Both the L and R channels can be easilyrecovered from the M and S channels.

3.2 Masking in Stereo

Next, the masking thresholds for M and S need to be calculated. This is a step-wise process.First the equation (1) is applied to each M and S frequency line in the exact manner as inthe aforementioned section to calculate the basic masking thresholds, denoted BTHRm andBTHRs [1].

To calculate the stereo masking contributions of the M and S channels, an additional factor,the masking level difference factor (MLD), is calculated at each frequency line and multipliedby each of the M and S masking level thresholds to obtain the masking level difference,

6

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1MLD factor

frequency (barks)

Figure 3: The MLD factor that is applied to the masking model.

denoted MLDm and MLDs. The MLD provides a second level of detectability of noise inthe M and S channels based on the masking level differences between the channels [1].

Essentially, the MLD is a measure of how detectable a masked signal in the M channel is inthe S channel and vice versa. The equation used to calculate the MLD factor is as follows[1]:

MLD = 101.25(1−cos(πmin(z,15.5)

15.5))−2.5 (6)

where z is the frequency in barks. Figure 3 shows what the MLD curve looks like. Now, theMLD factors can be calculated as:

MLDm = MLD ×BTHRm (7)

MLDs = MLD ×BTHRm (8)

The actual thresholds for M and S are calculated as follows:

THRm = max(BTHRm, min(BTHRs,MLDs)) (9)

THRs = max(BTHRs, min(BTHRm,MLDm)) (10)

The MLD signal essentially substitutes for the BTHR signal in cases where there is a chanceof stereo unmasking [1].

7

3.3 Bit Allocation

The bit allocation scheme used is the exact same as in the baseline coder. The main differenceis that both channels now share a common bit pool P , and use the SMRs obtained from themasking curves calculated using the addition MLD factor.

The waterfilling algorithm is now applied to all the frequency lines of both channels. Thisallows the algorithm to assign bits to lines regardless of which channel they are in. Essentially,this is the where the coding gains are achieved. If one channel has a much higher SMR thanthe other (in the case of the M/S representation), then more bits will be applied to thatchannel than the other.

3.4 Packing and Decoding

The only extra information needed for the joint stereo decoder is an additional bit thattells whether each subband is in an L/R or M/S representation. This information is passedalongside the usual bit allocation bits. The decoder parses this information and will convertany M/S representation into L/R for audio playback.

4 Results

We applied the joint stereo coding algorithm to various audio signals. One of the moreimportant tests was to make sure that the stereo image quality is not affected by codingthe signal as M/S. This was rigorously tested by modulating a signal’s channels by sinusoidswith different phase offsets.

Specifically, we modulated the left channel using a slowly varying cosine and the right channelwith a slowly varying sine. This results in the stereo image weaving from left to right. Duringthese tests we did not notice any degradation in the stereo and audio quality.

For the rest of the quality tests, we used a constant bitrate of 128kbps on various audiosignals. The results were on par with the baseline coder (ie. the quality was very good).Table 1 shows the listening test results in SDG.

5 Conclusion

Stereo coding has been shown to be very useful in reducing the redundancy in stereo audiosignals. One can achieve significant gains in stereo coding, which can be utilized to either

8

Audio Name SDG

Castanets -0.5

Rock Music 0

Pop Music 0

Harpsichord -0.25

Glockenspiel -0.1

Bass Singer 0

Table 1: Listening test results for the audio signals compressed with the stereo coder. Allresults use the SDG scale.

boost the quality of the reconstructed signal or to lower the bitrate while keeping the signalquality constant with respect to the original coder. This is due to the fact that the M/Srepresentation of the signal is essentially lossless.

Moreover, stereo coding does not hurt the stereo image when correctly utilizing the stereomasking model and shared bit pool methods. The tests show that the overall quality re-mains the same or is better. Stereo coding is also popular with standard audio compressiontechniques, including MP3.

6 Future Work

There are many possible extensions to stereo coding. One of these is intensity coding. Themain concept is that since much of the signal can be redundant in both channels, we cancode some parts of it as a mono signal, and multiplex it to a stereo signal. This is usuallydone for higher frequencies, where the ear is less sensitive to the stereo image. Using thisidea will guarantee a gain of 50% within those particular frequency bands.

Another possible extension is to use a variable bit pool that will save bits for later use. Thiswill be most dramatic in cases where the M/S representation is severely skewed to one side.Theoretically, we would only need half the number of bits to represent such a signal whilekeeping the quality constant. These extra bits may be used for more bit starved blocks inthe future. For example, it may alleviate pre-echo effects due to powerful transients.

References

[1] Johnston and Ferreira, Sum-Difference Stereo Transform Coding, Proc. ICASSP, pp.569-571, May 1992.

9

[2] Bosi and Goldberg, Introduction to Digital Audio Coding and Standards. Kluwer Aca-demic Publishers, 2003.

10

Stereo Coding for Audio Compression

Documents

Transcript of Stereo Coding for Audio Compression