The Effect of Bit-Errors on Compressed Speech, Music and ...

The University of Manchester School of Computer Science

The Effect of Bit-Errors on Compressed

Speech, Music and Images

Initial Project Background Report

2010

By Manjari Kuppayil Saji Student Id: 7536043

MSc. Advanced Computer Science and IT Management

mgax3rmw

Text Box

___________________________________________________________________________ 14th May 2010 i 75360430-ManjariKuppayilSaji.pdf

Abstract The need to reduce transmission data rates and storage space for data has given rise to various compression techniques. These techniques make use of human perception limitations to discard parts of data that are likely to be imperceptible to most humans. Multimedia data transmitted over wireless networks include speech, music and image files. Data sent over wireless LAN networks are often subject to bit-errors due to interference, collisions and so on. While any data is prone to such error, the effect of bit-errors on compressed data is likely to be more pronounced. This project aims to study how these errors will affect the perceived quality of the data. There have been various studies performed to determine a good metric that can objectively derive the quality of speech, music and images based on human perception models. They are meant to be an accurate estimate of relative subjective tests for quality ratings, which are more expensive to carry out. The project studies the underlying theory behind the different objective metrics. Some of the available software implementations to calculate objective scores will be tested and appropriate ones will be used for the study. The project uses MATLAB audio and image processing toolkits as the main software tool to simulate bit-errors and calculate quality scores.

mgax3rmw

Text Box

___________________________________________________________________________ 14th May 2010 ii 75360430-ManjariKuppayilSaji.pdf

TABLE OF CONTENTS

Abstract ................................................................................................................... i 1. INTRODUCTION ................................................................................................... 1

1.1 Description of Project ...................................................................................................... 1 1.2 Main Objectives ............................................................................................................... 1

1.3 Scope of Project ............................................................................................................... 2

1.4 Applications ..................................................................................................................... 3

1.5 Outline of Report ............................................................................................................. 3

2. PROJECT BACKGROUND AND RELATED WORK ............................................ 5

2.1 Mobile Communication over WLAN .............................................................................. 5 2.2 Compression Techniques ................................................................................................. 6 2.3 Compression standards in speech .................................................................................... 6 2.4 Compression standards in music ...................................................................................... 9 2.5 Compression standards in images .................................................................................. 11 2.6 Perceptual Quality Measurement ................................................................................... 12

2.6.1 Speech Quality Measurement ................................................................................. 13 2.6.2 Music Quality Measurement ................................................................................... 13 2.6.3 Image Quality Measurement ................................................................................... 14

3. RESEARCH METHODS....................................................................................... 16

3.1 Project Aims and Objectives .......................................................................................... 16 3.2 Software Tools ............................................................................................................... 16

3.3 Project Development ...................................................................................................... 16 3.3.1 Implementation ....................................................................................................... 16 3.3.2 Deliverables ............................................................................................................ 18 3.3.3 Evaluation of Results .............................................................................................. 18

4. CONCLUSION ..................................................................................................... 20

References ............................................................................................................. 21

Appendix A: Project Plan ........................................................................................ 23

mgax3rmw

Text Box

___________________________________________________________________________ 14th May 2010 1 75360430-ManjariKuppayilSaji.pdf

1. INTRODUCTION

1.1 Description of Project

Internet and Mobile connectivity has grown tremendously in the last few decades, and bandwidth has become an important and costly resource. The aim of compression is to allow information to be represented with fewer bits. For example, if we use a sampling frequency of Fs (samples per second) and represent each sample with N bits, we would be required to transmit ‘N x Fs’ bits per second. In a WLAN, the available frequency bandwidth is shared among many users. This means that if we can represent information with a fewer number of bits (reduce N) and make a compromise on quality by taking fewer samples per second (reduce Fs), we would be able to use the available bandwidth more economically and efficiently. The bandwidth could be shared by more people and information could also be sent faster. Using fewer bits to represent information would also be an advantage while storing large amounts of data such as images or music files which are generally many megabits in size. Data packets are often subject to bit-errors. In the case of a WLAN networks, data packets are error prone as they are subject to interference in the medium, collisions, multipath propagation effects, scattering of waves and so on. There exist error detection and correction protocols that try to correct such bit-errors such as Forward Error Correction (FEC) techniques. If correction due to these methods is not possible, typically a re-transmission protocol for the damaged packet is followed. But in the case of some real time applications, the delay in waiting for a re-transmission might not be feasible and a damaged packet itself is used in the application typically with some Packet Loss Concealment (PLC) techniques. Advanced PLC techniques follow a model based approach to interpolate the damaged packet based on speech models. As the original data is compressed, each bit that is transmitted potentially carries much more information than a bit of uncompressed data. Some compression techniques use some form of prediction, which means that an error in one bit could be have a cascading effect on the rest of the data further reducing the quality of the information. Therefore, the effect of bit-errors on compressed data is likely to be more pronounced as compared to normal uncompressed data. This effect is what will be investigated in this project. If the overall effect of bit-errors results in an acceptable quality, it would perhaps lessen the use of PLC techniques.

1.2 Main Objectives

This project is aimed at looking at the effect bit-errors have on compressed data. The main objective is to study and evaluate how bit-errors in compressed multimedia affect their perceived quality. To evaluate and compare perceived quality, there needs to be a method of quantifying it. In the case of speech, a metric known as the Mean Opinion Score (MOS) was developed [1] to subjectively measure the quality of narrowband speech. This standard score gave a numerical value to the telephone speech quality and has a rating of 1 (worst) to 5 (best) which was calculated by taking the average quality rating given by a number of human testers. As it is unrealistic to carry out subjective tests very often, a number of objective metrics are being studied to replace the need for human testers. They can be calculated using software implementations based on human perception models. The success of objective metrics lies in how well they can match subjective scores. As subjective scores like MOS is not as well-defined for music and images, and during the course of this project defining a MOS-like score for them based on the same principles as speech will be one of the objectives.

mgax3rmw

Text Box


Once the software implementations are evaluated and appropriate ones selected, the quality rating of uncompressed and compressed files to which bit-errors are introduced will be calculated. The calculation will be repeated for different bit-error probability rates. This would provide the data to compare and interpret the effects of bit-errors on the different data formats. A set of graphs plotted to show objective quality score vs. bit-error rate probability will also be produced. At the end of the project, an application code will be produced that has the following functionality:

• Ability to read the required speech, music and image files in the uncompressed and compressed formats.

• Functionality to convert the input files into bit format allowing introduction of different bit-error probability rates.

• In the case of speech and music, a functionality to playback the modified files and allow users to listen to the ‘damaged file’.

• Similarly with the image files, functionality that allows users to see the modified images with bit-errors introduced.

• Computation of a quality rating for the modified files which will be implemented using automated software relevant to the different file formats.

• Production of objective quality score vs. bit error probability rates. These results will then be studied and a conclusion will be drawn regarding the effect of bit-errors on compressed data. The results could help in selecting appropriate compression techniques and data rates depending on the application.

1.3 Scope of Project

This project focuses on three kinds of data: Speech, Music and Images. It concentrates on studying the effects of bit-errors on the following commonly used compression techniques: Speech: 1) G.711 – Pulse Code Modulation (PCM) 2) G.726 – Adaptive Differential Pulse Code Modulation (ADPCM) 3) G.729 – Conjugate Structure CELP (Code Excited Linear Prediction) Music: MP3 – MPEG-1 Audio Layer 3 Images: JPEG – Joint Photographic Experts Group As an extension, the effect of bit-errors on the quality of Moving Picture Experts Group (MPEG) compressed video files could also be studied. Short video clips, are quite popular and the study of bit-error effects could help improve the way they are compressed and transmitted. To simulate a WLAN environment, bit-errors will be introduced to files with compressed data. Perceived quality is a subjective measurement and ideally, calculating the quality would mean training a large group of human subjects to rate the different multimedia. It would then involve calculating a mean value among all the subjects. This would be time-consuming and inefficient. For this project, the quality of the various data will be measured using automated software. Bit-errors will

mgax3rmw

Text Box


be introduced to compressed data and a rating will be given to the quality of the resulting data. By increasing the percentage of bit-errors, the effect can be clearly observed. The application program developed will not include implementing the compression algorithms itself. Rather speech, music and image files in the above listed formats will be modified to introduce bit-errors.

1.4 Applications

Calculating objective quality scores for speech has obvious applications in mobile telephony and these scores have been used to calculate network quality and various experiments have been carried out to determine the objective quality scores of different compression algorithms in use today. The same extensive use of image and music quality metrics is not prevalent. The following are some applications where the results from this project may be used: Objective Scores for Music Quality One application where calculating quality could be useful is in the area of digital radio and television broadcasting. It uses compressed signals and since it is a real-time application, re-transmission requests are not feasible. By being able to calculate objective quality scores of transmitted data, it would be possible to make decisions regarding whether the packet can be used with an acceptable loss in quality. It can even be used in feedback mechanisms that allow the bit rate to be increased or adjusted depending on the perceived quality. A similar application would be the Internet radio. Another application would be in the creation of an internet choir [2]. In this real time application, the orchestra members interact over the internet therefore delays cannot be tolerated during a performance. Hence an objective quality score would provide information required to reduce the

number of re-transmissions for damaged packets. Objective Scores for Image Quality Transmitting images digitally almost always require compression as a single image can take megabytes of data. Sending images over wireless networks is becoming quite common with images being sent between mobile phones using Bluetooth. Having an objective quality measure can be useful by allowing mobile application developers decide transmission bit-rates. Another interesting application is in the field of teleradiology, where images such as CT scans are transmitted so that specialist doctors can analyze and diagnose a patient without having to be physically near the patient. In such cases, an objective quality score can help determine acceptable compression techniques and also help to decide if an image has to be re-transmitted. A related study is given in [3]. Along with these, applications of objective quality measurement can include video conferencing – which is a combination of audio and moving images. As it is a real time application, having a quality rating can be useful in improving the way it is implemented. 1.5 Outline of Report

The rest of this report is organized in the following manner. Chapter 2 introduces some of the main topics related to this project. First, the characteristics of mobile communication over a WLAN network are described. The main concerns are elaborated along with a brief description of how wireless networks contrast with traditional wired networks. Next, it outlines

mgax3rmw

Text Box


the compression algorithms used in the formats that are going to be used in this project. It explains how these algorithms make use of the human perception of sight and sound to improve the efficiency of these algorithms. Finally it also looks at some of the objective quality indices that are used for speech, music and images. Chapter 3 outlines how this project will be conducted. It describes the aims and objectives of the project. Then it gives a description of what software tools will be used in completing this project. Finally the implementation details are given along with a list of deliverables and how the results will be evaluated. The project has been broken down into tasks. A timeline has been given for completing the tasks and a Gantt chart is provided in Appendix A. The final Chapter gives a conclusion to the report.

mgax3rmw

Text Box


2. PROJECT BACKGROUND AND RELATED WORK This section gives background information on the current literature about the main topics related to this project. It also outlines some of the theory behind the compression algorithms that will be used in the project.

2.1 Mobile Communication over WLAN

The radio spectrum covers frequencies below 300 GHz. The use of this spectrum is regulated and various applications use different frequency bands. Some frequency ranges are reserved for military communication, while others are used for TV signals, RFID etc. The IEEE 802.11b which is the standard for wireless LAN communications specifies 2.4GHz as the frequency band for WLAN use. This band is split into 14 channels with adjacent channels overlapping. When data is sent over traditional wired networks, there is hardly any interference as the path (wires) through which the data packets are sent is well defined. It is also possible to predict degradation of signals as the property of the medium is known. On the other hand, in wireless networks, interference can occur due to a variety of factors outlined in [4]. As radio signals travel through the atmosphere, they interfere with physical structures like buildings and trees. This causes the signal to become weaker. Even particles of dust in the atmosphere can affect signals. This effect is called path loss. Another phenomenon that affects wireless networks is multipath propagation of signals. Signals could be reflected off physical structures and thus different parts of the signal could reach the receiver through different paths having different delays. This can lead to transmission errors as the data received through the different paths interfere with each other. This effect is known as Inter Symbol Interference (ISI). Additionally, in mobile communications, the sender or receiver or both could be moving, further introducing complexities such as constantly changing multipath propagation effects. This increases the probability of errors being introduced in the signal. Unlike wired communications, the medium (air) in wireless communications is shared among all users. This means that different signals could interfere among themselves. Thus for WLAN communications, this limited bandwidth has to be shared among all users. Commonly used techniques for sharing include multiplexing over time, frequency or code domains. The IEEE 802.11b specifies the Direct Sequence Spread Spectrum (DSSS) technique for transmission. From this discussion it is clear that the bandwidth is a costly resource. The main advantage of a WLAN is that it enables people to be mobile. Most applications these days require transmission of large amounts of data. This in turn increases usage of the available frequency bandwidth. To make transmissions more efficient, we need to reduce the amount of data to be transmitted and this is why compression techniques are essential. Effective compression techniques help to represent the same information using lesser bits. This has immediate advantages. It would reduce the amount data to be transferred over the WLAN and also take up less storage space when it has to be saved on a disk. Compression techniques for different kinds of data are discussed in the following sections.

mgax3rmw

Text Box


2.2 Compression Techniques

There are broadly two main compression techniques – lossless and lossy compression. In lossless compression, as the name suggests, the compressed data at the receiver’s end can be uncompressed to re-create an exact representation of the original data. One of the techniques that lossless compression algorithms exploit is the fact that data has inherent redundancy properties. For example, in text, if a single character repeats 100 times, it is more efficient to transmit that character and the number of times it repeats (100), rather than sending the character 100 times. Another example of data redundancy would be exploiting properties of the English language. For example, in the English alphabet, the letter ‘q’ is always followed by a ‘u’. In this case it is not necessary to send the character after ‘q’. Using a good encoding scheme to represent the data in binary format would also help in reducing the bandwidth usage. Lossless compression is generally used with text and data files. It also has applications in reducing the space required for storing information. Some common encoding algorithms used for this purpose are Huffman coding and the Lempel-Ziv-Welch (LZW) algorithm. In lossy compression on the other hand, the decompressed data at the receiver’s end will not exactly match the original signal. The advantage of this being much lesser bits are required to represent the information. Lossy compression is generally used for multimedia applications including speech, music and images. These techniques exploit the fact that a small drop in the quality of multimedia data will not be perceptible to humans or it can be tolerated depending on the application it is used for. For example, the human ear can hear sounds only in a particular range of frequencies and the human eye can see only certain wavelengths. Lossy compression algorithms include Linear Predictive Coding (LPC) and A-law for speech, MP3 for music and JPEG for images. The next section takes a look at some of the lossy compression algorithms in detail.

2.3 Compression standards in speech

G.711 – Pulse Code Modulation (PCM)

PCM is a waveform speech compression technique. Here the analog speech signals are first digitized by taking samples of the signal at intervals, and the values at those intervals are then represented in binary format. In G.711, Speech signals are sampled at 8000 samples/sec and the values are encoded using 8 bits. G.711 uses companding or non uniform quantization to achieve good quality speech using just 8 bits. It ensures that the signal to quantization noise ratio (SQNR) remains acceptable for speech with amplitudes ranging from soft to loud. This results in a 64kbits/sec bit stream output. The North American G.711 standard is called the µ-law, other countries use a slightly different version and it is known as the A-law. PCM first converts speech signals to digital signals (i.e.) the signals are represented as 0s and 1s. To do this, the amplitude of the signal is noted a fixed number of times a second. This is known as sampling frequency. The amplitude value is then represented using a fixed number of bits. This step is called quantization. Depending on the number of bits used to represent the sample, the signal can be divided into a fixed number quantization levels. Quantization introduces noise into the original signal because it involves an approximation or ‘rounding error’ when amplitude values are mapped to these quantization levels. When the quantization levels are constant, it is known as uniform quantization. This noise introduced can be measured by a metric known as Signal to Quantization Noise Ratio (SQNR). If this ratio is sufficiently high, the noise introduced will be imperceptible to the human ear.

mgax3rmw

Text Box


When uniform quantization is used, speech signals which have small amplitudes that lie between the quantization levels and speech signals having very large amplitudes that lie beyond quantization range will be represented with a large margin of error. This would reduce SQNR to an unacceptable value. To overcome this, non-uniform quantization is used, where the quantization levels lie closer together for small amplitudes and further apart for larger amplitudes. Figure 1 below illustrates non-uniform quantization (companding). In this figure, the samples are each represented with 8bits.

Figure1: Converting uniform quantization to non uniform quantization (Companding) [5] The effect of non-uniform quantization with 8 bits (PCM sample size) can be achieved by compressing the original signal and then quantizing it uniformly using 12 bits. Most computers represent integers using 16 bit representation, hence in PCM speech samples are represented by 16 bit values. To reduce the data bit rate to represent the speech sample, the G.711 A-law compression algorithm converts the 16 bit values into an 8 bit value. This value is in the format of floating point value. At the receiver’s end, the 8 bit floating point value is mapped to corresponding 16 bit value and the original signal is obtained. Thus, at 8 kHz sampling frequency, G.711 reduces the required data transfer rate from 128kbps to 64kbps achieving 50% compression.

G.726 – Adaptive Differential Pulse Code Modulation (ADPCM)

In Differential Coding, rather than encoding the values of the amplitude as in PCM, the difference between two adjacent samples are encoded. This allows representation using a fewer number of bits. ADPCM uses a prediction algorithm to predict the value of the next sample based on previous samples. The difference between the predicted sample and the actual sample is transmitted. The receiver uses the same prediction algorithm and can thus reconstruct the sample values. The term ‘adaptive’ refers to the fact that bits needed to encode the quantization levels are adjusted dynamically

mgax3rmw

Text Box


according to the speech signal. The G.726 standard reduces the bit stream output to 16, 24, 32 or 40 bits/second using 2, 3, 4 or 5 bits respectively while encoding the sample differences. An important consideration in differential coding is that if there is an error in the prediction, all subsequent samples will also have an error. To make sure the error does not persist, the receiver’s end is modified to decay the effect of an error. Sub-band ADPCM is a technique by which speech is split into bands and more bits are used to represent the frequencies which are more important for comprehending speech, and lesser bits for the bands used to convey for example, emotion.

G.729 – CS- CELP (Conjugate Structure-Code Excited Linear Prediction)

This is a source speech compression technique as opposed to a waveform codec. It is based on the source-filter model of speech production, which is a model built on the basis of how human sound is actually produced with the help of vocal chords, the vocal tract and other speech organs. Speech formation is a complex process. The source filter model helps to synthesize speech using mathematical equations. In order to achieve compression CELP, uses a two codebooks, one basic and one adaptive to represent characteristic residue signals. Source Filter Model [6]: Human speech is created by mainly two kinds of source signals, voiced and unvoiced sounds. Voiced sounds involve vibration of the vocal chords, while unvoiced sounds do not. In the English language, all vowels are voiced sounds. The frequency at which the vocal chords vibrate determines the perceived pitch of the sound and this is known as the fundamental frequency of voiced sounds. Constantly changing fundamental frequencies are what constitute emotion in human speech.

Figure 2: Source Filter Model [7]

Once the source is produced, the air has to pass through the vocal tract. The human vocal tract acts as a filter – ignoring some frequencies and amplifying others (resonance). The position and alignment of

mgax3rmw

Text Box


the various speech organs is what determines how the filter performs. The frequencies which are amplified by the vocal tract are called ‘formants’ and at these frequencies the amplitude of the signal is maximum. Characteristic frequencies called anti-formants are also present in nasal sounds that are necessary to accurately model speech. Voiced sounds are represented using their fundamental frequency and unvoiced sounds can be represented by white noise. The vocal tract is determined by formant and anti-formant frequencies. The effect of the lips on the human speech is called lip radiation and can be modeled using a high pass filter. A pictorial representation of the source – filter model is shown in Figure 2. Once these characteristics are determined, speech can be synthesized by passing the source signals through a filter that produces formant frequencies. Linear Predictive Coding [8][9]: The formants can be modeled using a set of equations which can predict the next output as a function of the previous outputs. Linear prediction computes coefficients for a time varying filter that represents the formants. So to transmit speech, the sender first calculates the source frequency, and inverse filters this with the original speech. The formants are then calculated from the result. At the receiver’s end, the formants are used to create a filter through which the source is passed – resulting in synthesized human speech. In practice, the residue left behind after the formants are calculated also contains useful information. Since it is not practical to send the residue information as well, a codebook of typical residue values are stored. The sender then checks which codebook entry most accurately matches the residue (least error) to be sent and sends this code to the receiver. The receiver uses the code to retrieve the residue and the synthesized speech is of better quality. To improve the efficiency of the codebook, G.729 uses two codebooks. A basic codebook contains residue samples for one pitch. A second adaptive codebook is filled during the transmission and provides pitch information by mapping time delays to changes in pitch. The G.729 standard achieves 8kbits/sec bit stream output rates.

2.4 Compression standards in music

MP3 – MPEG-1 Audio Layer 3

When music is digitized, it is stored on a CD by sampling at 44100 times a second and using 16 bits to represent each sample. This roughly calculates to about 1.4 Mbits per second data transfer rate. This data rate is highly impractical to be used over a WLAN to transfer music. The MP3 algorithm is based on the perceptual model of human hearing. So by modeling the characteristics of the human ear and removing parts of the music signal that will not be heard, the bits needed to represent the music sample can be reduced. The human ear is more sensitive to mid range frequencies between 2kHz and 4kHz. At lower and very high frequencies the tone has to be at a higher loudness level (decibels) to be perceived by humans. MP3 uses two phenomena called frequency masking (or simultaneous masking) and temporal masking to achieve compression [10]. In frequency masking, a loud tone masks softer ones at nearby frequencies. Equal loudness contour graphs, shown in Figure 3 [11] describe the threshold of hearing

of the human ear at different frequencies. As an example, the graph shown in figure indicates that a 20-decible sound at 1000 Hz would be perceived as the same sound level of 50 decibels at 100 Hz. This is taken into account during simultaneous masking. The dotted line in the graph marked as threshold of audibility is the minimum decibel levels needed by signals at different

mgax3rmw

Text Box


frequencies to be perceived by humans. A phon is the unit of measurement for the level of perceived loudness. It is defined as dBSPL at a particular frequency. This unit measures the loudness of a tone by taking into consideration the effects of frequency on perceived loudness. [14]

Figure 3: Equal Loudness Contour Graphs When loud tones are present, this threshold graph changes and more sounds will be masked at adjacent frequencies. This is shown in the Figure 4.

Figure 4: Effect of simultaneous masking on audibility threshold The human ear consists of 24 critical bands. The frequencies within these bands are harder to be distinguished by the ear. Thus, the effects of simultaneous masking can be modeled better within

mgax3rmw

Text Box


these critical bands. Therefore a music signal is first divided into 32 critical bands using filters. Each band is processed separately with mid range frequency bands being allocated more bits. For temporal masking, the fact that quieter tones that are present a short time before or after a loud tone are inaudible is taken into account. The time gap over which this effect is present is a function of the amplitude difference between the tones. MP3 uses encoding techniques like Huffman coding which represents more frequent values with lesser bits, further improving bit rates. Using MP3 compression, data transfer rates can be reduced to rates between 96 and 320 kbits /second. Depending on the application, a data rate that provides sufficient quality can be chosen.

2.5 Compression standards in images

JPEG – Joint Photographic Experts Group

Digital images are represented by pixels which are numerical representations of colour and brightness information. Colour images are typically represented as 3-D arrays with 8 bit values for each of the Red, Green and Blue components. The following are the broad steps followed [15]. In the first step of JPEG compression, pixels in an image are arranged into 8x8 blocks. JPEG uses the human eye’s perception of colour and brightness to reduce the size of images. The human eye is more sensitive to brightness compared to colour, therefore the colour information in an image can be coded at lesser accuracy. Hence RGB images are first converted into a YUV format which is a representation in terms of luminance (brightness) and chrominance (colour). Chrominance values of blocks of pixels are averaged as it is not required to depict them with complete accuracy. In the next step, a discrete cosine transform function (DCT) is performed on each block, representing them as a set of 64 DCT coefficients. The first coefficient (called DC coefficient) is the average value of the other 63 coefficients (AC coefficients). In the quantization step, two different tables of constants are used for the luminance and chrominance components. The DCT coefficients are divided by this constant and then rounded off to the nearest integer. In this step, information quality is lost. Smaller DCT coefficients which are not very important to human perception are almost eliminated by giving them a higher quantization factor. These quantization tables are also sent along with the compressed image in the JPRG header so that the receiver can decode the compressed image. Huffman coding is used in JPEG to represent the quantized values. Figure 5 depicts a block diagram of how the JPEG algorithm works.

mgax3rmw

Text Box


Figure 5: Steps in the JPEG Compression Algorithm [16]

2.6 Perceptual Quality Measurement

Mean Opinion Score (MOS) is a scale used to subjectively measure the quality of narrowband speech. An MOS scale ranges from 1 (worst) to 5 (best). This scale is generally used to measure the quality of compressed data or data sent over a communication channel. Interpretation of quality differs from individual to individual; therefore an MOS score would be determined by averaging the scores rated by a large group of test participants. Subjective scores for images and video are not well-defined. They could be calculated in the same way as an MOS score – by taking the average rating of a group of individuals. Subjective scores could be a time consuming and expensive process. Moreover it is feasible to perform subjective quality tests at short notice. Automating these scores, to get an objective quality rating which approximate human interpretation, is a much more effective strategy. The algorithms designed to calculate these scores are typically Full Reference algorithms which take both the original and processed/transmitted signals as inputs. For example, as a very simple measure, the signal to noise ratio can be calculated from the inputs and mapped to a rating level. But this would not give a very accurate indication of quality from a human perspective; hence the algorithm would have to be based on human perception models. The following are the standards laid down by Telecommunication Standardization Sector (ITU-T) for objective quality measurement: Speech: ITU-T Recommendation P.862 (02/01) - Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs.

Music (Audio): ITU-R Recommendation BS.1387-1 - Method for objective measurements of perceived audio quality. Images: No such standard exists for images though there are ITU recommendations for video and large screen digital imagery Video: ITU-T Recommendation J.247 (08/08) Objective perceptual multimedia video quality measurement in the presence of a full reference. It specifies a score called Perceptual Evaluation of Video Quality (PEVQ).

mgax3rmw

Text Box


2.6.1 Speech Quality Measurement

The first standard for telephone band quality measurement was ITU-T recommendation P.861 called Perceptual Speech Quality Measure (PSQM) which was introduced in 1996. This algorithm had some drawbacks, for example it could not assess quality of signals that were not time-aligned. In modern day situations where time delay is quite prevalent in applications such as VoIP, the algorithm was inadequate and hence a new recommendation had to be established. Research in the field led to the development of a new recommendation – PESQ – in February 2001[17]. PESQ - Perceptual Evaluation of Speech Quality As shown in Figure 6, the PESQ algorithm uses a perceptual model to derive internal representations of the input signals after time alignment. The difference in these signals is used to predict quality using a cognitive model.

Figure 6: The basic structure of the PESQ Algorithm [17]

The PESQ Algorithm has been used to calculate the quality of the waveform codecs (G.711, G.726) as well as the G.729 CELP codec at data rates above 4kbits/sec. The PESQ Algorithm has been recommended for the use of testing with environmental noise, among other recommendations, which is the key focus of this project. For an accurate objective score reading the original signal must be uncompressed and with without noise. The modified signal will be the compressed signal with errors introduced.

2.6.2 Music Quality Measurement PEAQ - Perceptual Evaluation of Audio Quality PEAQ is the algorithm is specified in ITU-R BS.1387 [18]. As with PESQ, it works by using human auditory models to determine quality of music. The algorithm takes the reference and signal to be tested as inputs. The signals input to the PEAQ algorithm must be sampled at 48 kHz and must be time aligned. The PEAQ algorithm analyses the audio samples frame-by-frame. For each of the two input signals, a set of “ear model” output values are calculated for every frame. These values called are obtained by passing each sample through an FFT-based ear model (as defined for the Basic version of ITU-

mgax3rmw

Text Box


BS.1387. The Advanced version uses both an FFT –based as well as a filter bank based ear model). The ear models values are based on “loudness, modulation, masking and adaptation” [19] effects as interpreted by the human brain. The difference between the values for corresponding frames in the test and reference signals are then determined. This is mapped to a set of Model Output Variables (MOVs) 20] which are determined based on audible differences – variations between the reference and test signals which are likely to be perceived by the human ear. The basic version of PEAQ defines 11 MOVs and the advanced version defines 5. The MOVs are mapped to a single value called the Objective Difference Grade (ODG) using a cognitive model. The cognitive model for PEAQ is based on an artificial neural network which was trained by using data collected from subjective listening tests. Figure 7 shows the steps involved in PEAQ.

Figure 7: Block Diagram of the PEAQ algorithm Basic Version 2.6.3 Image Quality Measurement

Earlier, image quality was measured using error sensitivity based approaches. In these approaches the images were split into different channels using methods like discrete cosine transformation (DCT). The corresponding channels are compared for the test and reference images using a Contrast Sensitivity Function (CSF) with different channels being given a different weighting factor based on visual sensitivity models. The weakness of this approach is outlined in [21]. The main weakness pointed out is that all kinds of distortions are treated similarly. This stems from the fact that some errors do not affect the perceived quality as much as other kinds of errors so. The error sensitivity

mgax3rmw

Text Box


approaches only work well for simple patterns – in more complex images, an interaction between channels also needs to be taken into consideration. Also the correlation between error sensitivity and perceived quality is not very strong. Based on these observations a new approach to image quality measurement in outlined in [22] and [23]. Both these indices are based on the theory that image quality can be measured as a comparison of structural similarity between the test and reference images. SSIM is a generalized version of UQI. SSIM – Structural Similarity Index In this approach, the luminance, contrast and structure of images are compared separately. Also luminance effects are removed from the images before their structures are compared for differences as the luminance does not affect structural differences. The SSIM index is calculated over the image in small 8X8 pixel windows. The overall image quality is then calculated as a mean value over the entire image. Mathematically, luminance can be measured as a mean value of the image samples and contrast is measured as standard deviation of the samples. The SSIM algorithm has been implemented in MATLAB and it will be used for this project. The following figure shows block diagram of how the SSIM index is derived.

Figure 8: Block Diagram of the SSIM algorithm [23] The resulting SSIM index lies in the range -1 to 1.A comparison of the UIQI and SSIM indices to other measurements such as the Mean Square Error (MSE) and a multidimensional quality measure using SVD (Singular Value Decomposition) is given in [24] and [25]. A related study has been done in [26] using a metric called Image Quality Scale. The material available on this measure is limited and a suitable software implementation was not available for further study.

mgax3rmw

Text Box


3. RESEARCH METHODS

This chapter lays out the methodology that will be followed to complete the project. It describes the software tools that will be used and how the project will be developed. The project has been separated into tasks and a project plan and a Gantt chart created which is given in Appendix A. This project involves introducing bit-errors in uncompressed and compressed speech, music and image files. The effect of the bit-errors will be observed as a series of graphs of bit-error probability vs. the perceived quality in terms of a mean opinion score.

3.1 Project Aims and Objectives

1) Study the effect of bit-errors on uncompressed and compressed speech, music and image files. 2) Study the effect of an increasing percentage of bit-errors on uncompressed and compressed narrow band speech (G.711, G.726 and G.729 codecs). 3) Study the effect of increasing percentage bit-errors on uncompressed and compressed music files (e.g. MP3). 4) Study the effect of increasing percentage of bit-errors on uncompressed and compressed images (e.g. JPEG). 5) Extend this study to include video files (e.g. MPEG).

3.2 Software Tools

MATLAB is the software language that will be used to carry out the study. It is a technical tool that allows programming in a language similar to C. MATLAB contains toolkits that allow both images and audio to be processed easily. Introducing errors in the compressed files and displaying the graph plots of the results will be done in MATLAB. In addition MATLAB implementations of objective quality assessment software will be used. These objective quality scores are based on the ITU standards as mentioned earlier in Chapter 2. The following implementations will be used: 1) Music: Matlab implementation of PEAQ - PQevalAudio.m. A related report is available in [27]. 2) Images: Matlab implementation of SSIM - ssim.m [31] To compress the speech and music files, the audio editing tool Audacity will be used. The Audacity 1.3.11 beta (Unicode) version was downloaded for this purpose. [32]. It uses the Lame_v3.98.2 encoder. The uncompressed files will be input and then exported into the various compressed formats as required. To compress image files to the JPEG format Microsoft Paint will be used.

3.3 Project Development

3.3.1 Implementation

The implementation will follow the structure listed below for compressed files: Step 1: Compressing original files Step 2: Reading compressed files into MATLAB Step 3: Introducing bit-errors into the files Step 4: Calculating Objective Quality Score Step 5: Producing bit-error rate probability vs. Objective Score graphs

mgax3rmw

Text Box


Figure 9: Block Diagram Showing the Implementation steps in the project The project will be divided into three main parts: Speech Analysis, Music Analysis and Image Analysis. Speech Analysis

A sample WAV speech file will be taken as the original uncompressed speech. This file will then be compressed into the required formats (G.711, G.726 and G.729) using Audacity – an audio editing tool. The compressed sample file will be read as a bit array. For introducing errors, an XOR computation is performed. A bit that is XORed with 1 will be inverted. Therefore to introduce errors, an ‘Error’ bit array is created with random number bits with the value 1 depending on the required bit-error probability. XORing these two arrays give the modified file. To listen to the modified file, the individual bits will be converted back into integer samples and played using the sound function in MATLAB.

Both the original WAV file and modified file are passed on to the PESQ software to calculate speech quality. This value will be calculated for the different bit-error probability rates and then the graph can be plotted. Music Analysis

An objective quality score for determining perceptual music quality will first be determined by experimenting with different metrics outlined in text. The comparison of the metrics will be based on tests as outlined in the section on Evaluation of Results (3.3.3).

To evaluate the effect of bit-errors on compressed music, uncompressed music samples will be obtained by ripping a file from a CD in the WAV format. The music editing tool Audacity will be used to compress this file into MP3 format. Errors will be introduced in the same way as for Speech files. To calculate an objective quality score, the original and modified file will be passed to code that calculates the chosen objective quality score.

6. Vary bit-error rate

mgax3rmw

Text Box


Image Analysis

As with music, an objective quality score for images will be determined first. To evaluate effect of bit-errors, a digital image in the GIF format will be used as the uncompressed version. It will be compressed into JPEG using the Microsoft Paint software. The resulting JPEG image file will be read as bits using the imread function in MATLAB. The integer values will then be converted into bits and stored into a bit array. Errors are introduced similar to the Speech files. To display the modified file, the bit values will be converted back into integer values and then displayed using the imview function in MATLAB. To calculate the objective quality of an image, it will first be converted into grayscale. There are MATLAB functions that can be used to do this. Then the original and modified image files will be passed on to SSIM code which will provide a Structural Similarity rating.

3.3.2 Deliverables

• Establishing a MOS-like subjective score for music and images. • Determining an appropriate objective quality measure for music and images by testing

different measures outlined in literature. • Calculating an objective quality score for uncompressed and compressed data using

appropriate software. • An interpretation of the effect of bit-errors on uncompressed and compressed data using the

calculated objective scores. • A comparison of the effect of bit-errors on the various compression algorithms studied.

3.3.3 Evaluation of Results

I. Objective Quality Measures In the case of speech files, the objective opinion score (PESQ) is a standard implementation for perceptual speech quality calculation [28] .This score has been tested and shown to closely match Mean Opinion Scores (MOS) for telephone quality speech. In the case of music and image files, different objective measures specified in literature need to be evaluated. The only way to test the accuracy of objective quality ratings obtained will be to compare these scores with results from a subjective analysis. As music and image data do not have a definite MOS-like subjective score as for speech, a subjective test will be carried out to define a subjective score. ITU BS-1284 -1 [29] provides standards for conducting listening tests. These guidelines will be used while collecting the subjective scores for music files. A similar guideline is outlined for calculating quality of television images in [30], which can be used for this purpose. As part of the test, a group of 10-12 individuals will be asked to rate an image or music file on a scale of 1 to 5 based on noticeable impairment of the file: 5 – Imperceptible 4 – Perceptible but not annoying 3 – Slightly annoying 2 – Annoying 1 – Highly Annoying The test will be carried out on different samples of uncompressed files, and files compressed in the different formats.

mgax3rmw

Text Box


In literature, there are some objective quality score values that do not lie in a range between 1 and 5. Hence the subjective score values obtained from the tests will be converted to this range. The objective quality scores can then be tested for their accuracy by comparing the scores with the subjective scores obtained. Once a suitable objective quality score is determined, it can be used to calculate the objective scores of files with bit-errors introduced into them. II. Interpretation of the effect of bit-errors on compressed data Once the objective scores are obtained for files with bit-errors introduced, an interpretation can be drawn regarding:

• The robustness of the compression technique to bit-errors. • Whether re-transmission requests in case of damaged packets over WLAN networks can be

avoided for some applications such as real-time rendition of data.

mgax3rmw

Text Box


4. CONCLUSION

This Background report has laid out the groundwork for completing the project ‘Effect of bit-errors on compressed speech, music and images’. It has outlined the key concerns for this project and outlined its main aims and objectives. It has elaborated the principles on which this project is based including current research material. This report also mentions the research methods that will be used to complete this project focusing on the implementation and development of the code that will be used to study and demonstrate the effect of bit-errors on compressed data. Finally the report also gives a timeline that will be followed until submission of the final Dissertation report. The final Dissertation report will include the results that were obtained after carrying out the tests and display them in the form of graphs. It will also include an evaluation of the results obtained and comments on difficulties encountered while carrying out the tests. Compression techniques for transmitting data are becoming commonplace to improve bandwidth usage efficiency and to reduce storage space. Compression techniques are getting increasingly complex as they use human perception models to reduce the amount of data to be transmitted. With such a wide variety of applications that use compressed data, there could be many ways the results of this project are used, for example as feedback mechanisms to adapt transmission data rates to maintain quality or to reduce the number of re-transmissions required in the case of damaged packets. In the case of speech, the PESQ objective measurement has had many uses in the area of mobile telephony including determining the quality of the network and which compression technique should be used for a particular application. In the same way, determining an appropriate method to objectively measure the quality of compressed music, speech and video can be useful as the number of applications that use these kinds of data increase. It seems that such a study has not been conducted before and hence the results will be interesting to study.

mgax3rmw

Text Box


References

[1] ITU. (1996). ITU-T Recommendation P.800, Methods for Subjective determination of Transmission Quality, Geneva: International Telecommunication Union. [2] Piquepaille, R. (2007). Virtual Choir Singing on the Web [Online]. Available at: http://www.zdnet.com/blog/emergingtech/virtual-choir-singing-on-the-web/637 [Accessed: 11 May 2010] [3] Ghrare, S.E., Ali, M.A.M., Ismail, M. and Jumari, K. (2008). The Effect of Image Data Compression on the Clinical Information Quality of Compressed Computed Tomography Images for Teleradiology Applications. European Journal of Scientific Research. 23(1). 6-12 [4] Schiller, Jochen H. (2003) Mobile communications. 2nd ed. London: Addison-Wesley. [5] Hanes, D. and Salgueiro, G. (2008). Chapter 4: Passthrough [Online]. Available at: http://www.networkworld.com/subnets/cisco/081308-ch4-ip-telephony.html [Accessed 30 April 2010]. [6] Styger, T. and Keller, E. (1994). Formant synthesis. In E. Keller (Ed.), Fundamentals in Speech Synthesis and Speech Recognition (pp. 109-128). Wiley. [7] Wolfe, J., Garnier, M. and Smith, J. Voice Acoustics: An introduction [Online]. Available at: http://www.phys.unsw.edu.au/jw/voice.html [Accessed: 25 April 2010]. [8] Oppenheimer, P. Digitizing Human Vocal Communication [Online]. Available at: http://www.troubleshootingnetworks.com/language.html [Accessed: 10 April 2010] [9] Atal, B.S. and Hanauer, L.S. (1971). Speech analysis and synthesis by linear prediction of the speech wave. Journal of the Acoustical Society of America. 50, 637-655. [10] Wilburn, T. (2007). The Audio File: Understanding MP3 compression [Online]. Available at: http://arstechnica.com/old/content/2007/10/the-audiofile-understanding-mp3-compression.ars/ [Accessed: 15 April 2010] [11] Fletcher, H. and Munson, W.A.(1933). Loudness, its definition, measurement and calculation. Journal Acoustic Society America. 5. 82-108 [12] Rabiner, L.R., and Schafer R.W. (2007). Introduction to Digital Speech Processing. Foundations and Trends in Signal Processing. 1(1-2). 1-194. [13] Johnson, D. (2009). Modelling the Speech Signal [Online]. Available at: http://cnx.org/content/m0049/latest/ [Accessed: 10 April 2010] [14] Wolfe, J. (1996). dB: What is a decibel [Online]. Available at: http://www.phys.unsw.edu.au/jw/dB.html [Accessed: 9 May 2010] [15] Lane, T. (2010). Introduction to JPEG [Online]. Available at: http://www.faqs.org/faqs/compression-faq/part2/section-6.html [Accessed: 20 April 2010] [16] King, A. (2004). Graphics, Blur Backgrounds for Optimized JPEGs [Online]. Available at: http://www.websiteoptimization.com/speed/tweak/blur/ [Accessed: 26 April 2010].

mgax3rmw

Text Box


[17] Beerends, J. G., Hekstra, A. P., Rix, A. W., and Hollier, M. P. (2002). Perceptual evaluation of speech quality (PESQ), the new ITU standard for end-to-end speech quality assesment, part II - psychoacoustic model. Journal of the Audio Engineering Society, 50. [18] ITU. (2001). ITU-R BS.1387-1, Method for the Objective Measurements of Perceived Audio Quality, Geneva: International Telecommunication Union. [19] Campbell, D., Jones, E. and Glavin. M. (2009). Audio quality assessment techniques: a review and recent developments. Signal Processing, 89, 1489–1500. [20] Salovarda, M., Bolkovac, I. and Domitrovic, H. (2004) Comparison of Audio Codecs using PEAQ algorithm. 6’th CUC ConferenceProceedings. Zagreb, Croatia. 27th – 29th September 2004. [21] Wang, Z., Bovik, A.C., and Lu, L. (2002). Why is image quality assessment so difficult. Proceedings of the IEEE Int. Conf. Acoustics, Speech and Signal Processing, 4, 3313-3316. [22] Wang, Z. and Bovik, A.C. (2002). A Universal Image Quality Index. IEEE Signal Processing Letters, 9(30). [23] Wang, Z., Bovik, A.C., Sheikh, H.R. and Simoncelli, E.P. (2004). Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4). [24] Konnik, M.V. (2009). Small Survey of Objective Image Quality Metrics [Online]. Available at: http://www.mvkonnik.info/2009/03/small-survey-of-objective-image-quality.html [Accessed: 4 May 2010] [25] Cadik, M. Evaluation of Image Quality Metrics [Online]. Available at: http://www.cgg.cvut.cz/members/cadikm/iqm/ [Accessed: 1 May 2010] [26] Yamsang, N. and Udomhunsakul, S. (2009). Image Quality Scale (IQS) for Compressed Images Quality Measurement. Proceedings of the International MultiConference of Engineers and Computer Scientists. Hong Kong. 18th – 20th March 2009. [27] Kabal, P. (2002). An Examination and Interpretation of ITU-R BS.1387: Perceptual Evaluation of Audio Quality. TSP Lab Technical Report. Dept.ECE, McGill Univ. May 2002. [28] ITU. (2001). ITU-T Recommendation P.862 (02/01), Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. Geneva: International Telecommunication Union. [29] ITU. (2003). ITU-R BS.1284-1, General Methods for the Subjective Assessment of Sound Quality, Geneva: International Telecommunication Union. [30] ITU. (2002). ITU-R BI.500-11, Methodology for the Subjective Assessment of the Quality of Television Pictures, Geneva: International Telecommunication Union. [31] Wang, Z. (2004). Zhou Wang’s Homepage [Online]. Available at: http://www.cns.nyu.edu/~zwang/ [Accessed: 20 April 2010] [32] Audacity Software Download 1.3.12(Beta). Available at: http://audacity.sourceforge.net/ [Accessed: 20 April 2010]

mgax3rmw

Text Box


Appendix A: Project Plan The project has to be completed by 10th September 2010. After submitting this background report there will about 12 weeks to complete it. A project plan has been created by dividing the project into three main parts: Speech Analysis, Music Analysis and Image Analysis. It is expected that music and image analysis will take more time to complete as the Mean Opinion Score calculation is not as straightforward as with speech. Hence more time has been allocated to them. The following individual tasks have been identified for the project:

Task Name Duration Start Finish

Task 1

Background Research and writing Background

Report 21 days 15-04-10 12-05-10

Task 2 Semester 2 Exams and Preparation 15 days 15-05-10 03-06-10

Part 1: Speech Files 8 days 07-06-10 16-06-10

Task 3 Reading speech files and introducing errors 3 days 07-06-10 09-06-10

Task 4 Calculating MOS scores 3 days 10-06-10 14-06-10

Task 5 Plotting graphs for speech 3 days 15-06-10 17-06-10

Dissertation Writing 10 days 08-06-10 19-06-10

Part 2: Music Files 17 days 19-06-10 12-07-10

Task 6 Carrying out subjective music quality tests 6 days 19-06-10 25-06-10

Task 7 Reading music files and introducing errors 4 days 19-06-10 23-06-10

Task 8 Calculating quality rating 10 days 24-06-10 07-07-10

Plotting graphs for music 4 days 08-07-10 13-07-10

Dissertation Writing 14 days 24-06-10 13-07-10

Task 9 Part 3: Image Files 13 days 14-07-10 30-07-10

Task 10 Carrying out subjective image quality tests 6 days 14-07-10 21-07-10

Task 11 Reading image files and introducing errors 3 days 14-07-10 16-07-10

Calculating quality rating 7 days 19-07-10 27-07-10

Task 12 Plotting graphs for images 3 days 28-07-10 30-07-10

Task 13 Dissertation Writing 14 days 14-07-10 02-08-10

Task 14 Complete Dissertation 21 days 03-08-10 30-08-10

mgax3rmw

Text Box


The following is a Gantt chart that has been created with approximate durations for completing the tasks.

mgax3rmw

Text Box

The Effect of Bit-Errors on Compressed Speech, Music and ...

Documents

Transcript of The Effect of Bit-Errors on Compressed Speech, Music and ...