Assessment and prediction of speech transmission quality ... · Speech communication over long...

132
Assessment and prediction of speech transmission quality with an auditory processing model Vom Fachbereich Physik der Universit¨atOldenburg zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) angenommene Dissertation Martin Hansen geb. am 20. Sept. 1967 in Flensburg

Transcript of Assessment and prediction of speech transmission quality ... · Speech communication over long...

Page 1: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Assessment and prediction ofspeech transmission qualitywith an auditory processing

model

Vom Fachbereich Physik der Universitat Oldenburgzur Erlangung des Grades eines

Doktors der Naturwissenschaften (Dr. rer. nat.)angenommene Dissertation

Martin Hansengeb. am 20. Sept. 1967

in Flensburg

Page 2: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Erstreferent: Prof. Dr. Dr. Birger KollmeierKorreferent: Prof. Dr. Volker MellertTag der Disputation: 17. Juni 1998

Page 3: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Abstract

In this thesis, an objective prediction method for the transmission quality of low-bit rate speech coding algorithms is described. A quantitative processing model ofthe auditory system is employed to objectively measure the perceptually relevantdeviations between the coded, distorted signal and the corresponding referencesignal. The inherent parameters of the processing model were derived directlyfrom psychoacoustical data independent of the present studies. The auditory pro-cessing is applied to transform the corresponding speech signals to an internalrepresentation which is thought of as the information that is accessible to higherneural stages of perception. The correlation coefficient between these two internalrepresentations constitutes the objective speech quality measure qC . It shows ahigh performance in the prediction of the mean opinion score data of various lowbit-rate coded speech test data bases, if a frequency-dependent weighting is ap-plied that exhibits increasing weights for increasing center frequencies of the filterchannels of the internal representation.

This non-uniform relative importance of different critical bands for the percep-tion of speech transmission quality is further investigated in two experiments. Twoalgorithms are introduced that generate a band-specific modulated-noise distortionin the speech signal. Detection thresholds are measured as a function of the centerfrequency of the band used for generating the distortion. Pairwise speech qualitypreferences of these distortions are assessed at levels of the modulation depth thatare selected relative to the respective detection thresholds. The detection thresh-olds are modeled with only small deviations by assuming a constant value of qC atthreshold, if a constant weighting of the filter channels of the internal representa-tion was employed. No satisfactory prediction of the detection threshold is foundif a weighting increasing with frequency was employed. Similarly, pairwise speechquality preference ratings are modeled by the difference ∆qC of the measure qCwith constant spectral weighting, but neither for the spectral weighting increasingwith frequency.

To extend the current speech quality assessment methods beyond the limits ofstationary transmission conditions, a new method is introduced for continuouslyassessing the time-varying speech quality. Different sequences of sentences aredegraded in quality by a modified Modulated Noise Reference Unit with a time-varying modulation depth. Subjects can monitor the speech quality variationsvery accurately by moving a slider along a graphical scale. The assessment taskis performed in a highly consistent way across subjects with respect to the use ofthe slider rating scale and the delay of approximately 1 s relative to the expectedtarget slider position. The new objective speech quality measure qC is modified toallow for a time-depending quality prediction for frames of 20 ms. Except for thesame delay of 1 s, the subjective assessment results can be predicted very well bythe time-dependent measure qC(t) if a low-pass filter at 0.5 Hz is applied to qC(t)in order to reduce its short time variability.

The prediction method introduced here may be used for monitoring speechtransmission quality in realistic time-varying conditions and for the quality opti-mization of speech coding algorithms, e.g., for mobile telephones.

Page 4: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Zusammenfassung

In dieser Arbeit wird eine objektive Methode zur Vorhersage derSprachubertragungsqualitat von Sprachkodierungsalgorithmen mit niedrigenBitraten beschrieben. Ein quantitatives Modell der “effektiven” Signalverar-beitung im auditorischen System wurde verwendet, um die perzeptiv relevantenAbweichungen zwischen dem kodierten, verzerrten Signal und dem entsprechendenReferenzsignal objektiv zu bestimmen. Die im Modell enthaltenen Parameterwurden aus von der vorliegenden Arbeit unabhangigen psychoakustischenDaten abgeleitet. Das Signalverarbeitungsmodell wurde verwendet, um diebeiden Sprachsignale auf eine interne Reprasentation abzubilden, die diejenigeInformation enthalt, die fur nachfolgende neuronale Stufen zur Auswertungzur Verfugung steht. Der Korrelationskoeffizient zwischen den beiden internenReprasentationen stellt das objektive Sprachqualitatsmaß qC dar. Es zeigt einehohe Korrelation mit den subjektiven Mean Opinion Score Daten verschiedenerniedrigraten-kodierter Testdatenbasen fur den Fall, daß eine frequenzabhangigeGewichtung der einzelnen Bander der internen Reprasentation angewandt wird,die mit der Bandmittenfrequenz ansteigt.

Diese nicht-gleichformige relative spektrale Gewichtung unterschiedlicher kri-tischer Bander bei der Messung der Sprachubertragungsqualitat wurde in zweiExperimenten weiter untersucht. Es wurden zwei Algorithmen eingefuhrt, die einebandspezifische Rauschmodulationsverzerrung im Sprachsignal generieren. Detek-tionsschwellen wurden als Funktion der Mittenfrequenz des generierenden Bandesgemessen. Die Detektionsschwellen wurden mit geringen Abweichungen durch dieAnnahme eines konstanten Wertes von qC modelliert, falls eine konstante Gewich-tung der Filterkanale der internen Reprasentation verwendet wird. Fur die mitder Mittenfrequenz ansteigende Gewichtung konnte dagegen keine befriedigendeVorhersage der Detektionsschwellen erzielt werden. Anschließend wurde die paar-weise subjektive Sprachqualitatspraferenz dieser Verzerrungen, die bei verschiede-nen Modulationsgradparametern relativ zur entsprechenden Detektionsschwellegeneriert wurden, subjektiv beurteilt. Die paarweisen subjektiven Sprachquali-tatspraferenzurteile konnten durch den Unterschied ∆qC zwischen den Maßen qCder verglichenen Satze modelliert werden, falls ebenfalls eine konstante Gewichtungverwendet wurde, nicht jedoch fur die ansteigende Gewichtung.

Um die bisherigen, auf stationare Ubertragungsbedingungen beschrankteMethoden zur Sprachqualitatsbeurteilung zu erweitern, wurde eine neue Me-thode zur kontinuierlichen Beurteilung von zeitlich variabler Sprachqualitateingefuhrt. Sequenzen verschiedener Testsatze wurden durch eine modifizierteModulated Noise Reference Unit mit zeitlich variierendem Modulationsgrad inihrer Sprachqualitat verandert. Versuchspersonen konnten die Variation derSprachqualitat sehr genau beurteilen, indem sie zeitlich kontinuierlich einenSchieberegler entlang einer Skala einstellten. Die Versuchspersonen erfullten dieBeurteilungsaufgabe sehr konsistent im Hinblick auf die Ausnutzung der Regler-skala bei einer Verzogerung von ca. 1 s relativ zur erwarteten Reglerposition. Dasneue objektive Sprachqualitatsmaß qC wurde modifiziert, um eine zeitabhangigeQualitatsvorhersage basierend auf 20 ms langen Abschnitten zu ermoglichen.

Page 5: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

v

Abgesehen von der Verzogerung von 1 s konnten die subjektiven Urteile sehr gutdurch qC(t) vorhergesagt werden, wenn ein Tiefpaßfilter bei 0.5 Hz verwendet wird,um die Kurzzeitvariabilitat zu verringern.

Die hier vorgestellte Methode zur Sprachqualitatsvorhersage kann einge-setzt werden, um die Sprachubertragungsqualitat in realistischen zeitvariantenSystem zu uberwachen sowie zum Beispiel fur die Qualitatsoptimierung vonSprachkodierungsalgorithmen fur die Mobiltelefonie.

Page 6: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In
Page 7: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Contents

1 General introduction 1

2 Objective speech quality measurement 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Subjective speech quality tests . . . . . . . . . . . . . . . . . . 92.3 Method for objective speech quality measurement . . . . . . . 10

2.3.1 Prealignment . . . . . . . . . . . . . . . . . . . . . . . 102.3.2 Concatenation . . . . . . . . . . . . . . . . . . . . . . . 112.3.3 Auditory processing model . . . . . . . . . . . . . . . . 112.3.4 Distance measure . . . . . . . . . . . . . . . . . . . . . 14

2.4 Objective speech quality prediction results . . . . . . . . . . . 162.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.1 Effect of band importance weighting . . . . . . . . . . 222.5.2 Optimal band weighting: Are high weights necessary

because of frequency band limitations? . . . . . . . . . 272.5.3 Alternative justification of the optimal band weighting 292.5.4 Optimal choice of parameters in adaptation loops . . . 312.5.5 Comparison with other processing algorithms . . . . . 332.5.6 Comparison of qC and PSQM . . . . . . . . . . . . . . 34

2.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . 38

3 Perception of speech quality distortions 403.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Speech stimulus material and test signal generation . . 433.2.2 Generation of band-specific distortions . . . . . . . . . 453.2.3 Detection of band-specific distortions . . . . . . . . . . 463.2.4 Quality assessment by paired comparison . . . . . . . . 473.2.5 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3.1 Detection thresholds . . . . . . . . . . . . . . . . . . . 48

vii

Page 8: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

viii CONTENTS

3.3.2 Speech quality assessed by paired comparison . . . . . 523.4 Modeling predictions . . . . . . . . . . . . . . . . . . . . . . . 54

3.4.1 Modeling distortion detection thresholds . . . . . . . . 543.4.2 Modeling speech quality by pairwise comparison . . . . 57

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.6 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 64

4 Continuous assessment of speech quality 664.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2.1 Speech stimulus material . . . . . . . . . . . . . . . . . 684.2.2 Generation of quality degradation . . . . . . . . . . . . 694.2.3 Experiment 1: Quality assessment of isolated words . . 694.2.4 Experiment 2: Continuous assessment of time varying

speech quality . . . . . . . . . . . . . . . . . . . . . . . 714.2.5 Test subjects . . . . . . . . . . . . . . . . . . . . . . . 73

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.3.1 Quality assessment of isolated words . . . . . . . . . . 734.3.2 Continuous assessment of speech quality . . . . . . . . 75

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.4.1 Quality scores based on isolated words . . . . . . . . . 804.4.2 Continuous scaling method . . . . . . . . . . . . . . . . 82

4.5 Modeling continuous speech quality . . . . . . . . . . . . . . . 844.6 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 89

5 Summary and Conclusion 91

A Reprint of “Prediction of Speech Quality based on Psychoa-coustical Preprocessing Models” 95

B The relation of qC and the likelihood ratio l 102

C Calculation of the time-dependent speech quality measures 105

D Evaluating a hardware implementation with qC 112

Bibliography 114

Danksagung 123

Lebenslauf 124

Page 9: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Chapter 1

General introduction

Speech communication over long distances has become one of the mostprominent attributes of our modern culture. In this context, speech process-ing and data reduction algorithms have become indispensable features thatallow the optimal use of transmission media and increase the number of simul-taneously transmitted conversations. The modern telephone communicationnetworks provide a wide range of voice services using many transmission sys-tems. In particular, the rapid development of digital technologies in the areaof mobile telephony has led to an increased need for evaluating and optimiz-ing the transmission characteristics of the devices involved. Hence, the areaof speech transmission quality evaluation both by subjective and objectivemethods has undergone a rapid and continuous development during the pasttwo decades. This thesis is concerned with developing and testing new psy-chometric evaluation methods and auditory-model-based prediction methodsof speech quality assessment.

The assessment of speech quality is mainly of interest for the evaluationof speech transmission systems which offer 100% or very near-to-100% speechintelligibility, because these systems cannot be distinguished by speech intelli-gibility measures (Sotscheck, 1992; Preminger and Van Tasell, 1995). Subjec-tive speech quality tests seek to quantify the range of opinions that listenersexpress when they listen to speech transmission systems that are under test.For the evaluation of these systems, many subjective assessment procedureshave been developed, refined, and standardized over the past decades. Thedifferent methods may be distinguished by many aspects. They can, e.g.,involve conversational tests or listening-only tests. Conversational tests (i.e.,tests where two subjects have to listen and talk interactively via a transmis-sion system) will achieve a more realistic test environment for the assessmentof speech quality. On the other hand, they are much more time consumingto perform and are often subject to lower reproducibility.

1

Page 10: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2 Chapter 1: General introduction

For most cases the recommended method is therefore a listening-onlytest. Commonly used subjective test methods are Absolute CategoryMethods, Degradation Category Methods, Detectability Methods, Com-parison Category Methods, and Threshold Methods. Recent reviews onseveral aspects of speech quality assessment are given in, e.g., (Quack-enbush et al., 1988; Kitawaki, 1990; Sotscheck, 1992; Dimolitsas, 1993;Jekosch, 1993; Kroon, 1995). One widely used method is the direct eval-uation of the speech quality by an Absolute Category Rating (ACR). Thesubject is presented with short groups of unrelated sentences which werepassed through a system under test. Typically, the subject’s task is to ratehis/her impression on a five-point scale with absolute categories. An es-timate of the quality is then the arithmetic mean of the responses of allsubjects which is called the mean opinion score (MOS). Other currentlyrecommended assessment methods are described in, e.g., (ITU-T, 1996a;1996d).

Subjective speech quality data acquired with good reliability and repro-ducibility generally require large investments in terms of technical equipmentand manpower. Such efforts are necessary and accepted for standardizationsor specification tests, that have, e.g., been performed to establish the GSMcodec standard in Europe. The costs of these tests are, however, unacceptableduring the development of algorithms and devices. Therefore non-auditive,i.e., merely instrumental methods for a quality judgement have been of greatinterest for a long time already (Heute, 1996). The goal in objective speechquality measurement is to predict speech transmission quality based on ob-jective measures of physical parameters and properties of the speech signalwaveform. An automated implementation of an objective test requires signif-icantly less effort, time, and expense than the corresponding subjective tests.On the other hand, often a considerable deviation between the objective andsubjective test results is observed (Voran, 1994). In this case subjectiveresults are generally considered to hold the “correct answer”. The perfor-mance of objective measures is therefore judged by their respective ability toapproximate the subjective speech quality results as closely as possible.

The first widely successful means of an objective speech intelligibility pre-diction was the articulation index (AI) (French and Steinberg, 1947). The AIis defined as the weighted sum of the Signal-to-Noise Ratio (SNR) measuredin individual bands. A measure of speech intelligibility which is more gener-ally applicable than the AI is the speech transmission index (STI) (Steenekenand Houtgast, 1980; Houtgast and Steeneken, 1985). The STI extends theconcept of a weighted sum of SNR’s across frequency bands. It is based onthe calculation of the modulation transfer function in seven octave bands. Anequivalent SNR is calculated from the modulation index in 14 modulation

Page 11: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3

frequency octave bands. The STI is finally calculated as a weighted mean ofthe SNR’s across the different bands and modulation bands.

Both the STI and the AI are only well-defined for linear systems whichmay additionally exhibit additive noise. For these systems, STI and AIcorrelate very well with speech intelligibility scores. The application of theSTI to nonlinear systems has extensively been investigated in the context ofhearing aids and speech perception of hearing impaired subjects (Hohmannand Kollmeier, 1995; Kollmeier, 1990; Plomp, 1988; Villchur, 1989). However,for more nonlinear systems such as low-bit rate speech coding devices, theapplicability of STI and AI is very limited. In addition, these measures arenot able to predict speech transmission quality since this aspect is mostlyconsidered for systems with a near-to 100% speech intelligibility where theSTI and AI approach their limiting value 1.

Attempts have been made to construct new measures based on the SNRconcept. These led to the so-called segmental SNR measures, which aremerely a sum of the SNR across spectral bands and temporal segments(cf. Mermelstein (1979); Quackenbush et al. (1988)). Another type ofspeech quality measures were constructed by linear combination of severalpsychoacoustical “one-number”-measures (e.g., Berger and Merkel (1994b);Petersen et al. (1997)), such as loudness, (un)pleasantness, roughness,sharpness, and similar quantities for which computation algorithms exist.The weighting factors of the terms in the linear sum are subject to numericaloptimization. However, segmental SNR measures and the linear combinationmeasures tend to be only optimal for one typical class of signal degradations.They often fail when applied to other transmission systems which they werenot optimized for and are especially of limited use for unknown systems tobe developed.

Recently, models of the signal processing in the auditory system havebeen applied to methods for objective speech quality measurement. In chap-ter 2 of this thesis a new objective speech quality measure of this kind isdescribed. The focus of the investigations is put on the auditory process-ing algorithm that aims to describe the perception strategies of the auditorysystem by employing an “effective” model of the signal processing. Freeparameters of the algorithm were motivated from extensive psychoacous-tical modeling of detection and masking experiments (Dau et al., 1996a;1996b). The speech quality measure was applied to predict the results offour different test data bases containing low-bit rate coded speech from var-ious coding algorithms. The influence of the parameters on the performanceof the speech quality prediction was investigated. Objective speech qualitymeasures with alternative approaches to model the auditory preprocessing

Page 12: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4 Chapter 1: General introduction

were implemented and their performance is compared with the new speechquality measure.

For the development of the objective speech quality measure, a band-specific weighting had to be applied in order to yield maximum performance.The characteristic of the optimal weighting pronounces the highest filterchannels more strongly than those at low and medium center frequencies.This unexpected frequency weighting will be further considered in chapter 3.The assumption is tested, that the perception of speech quality involves acertain band-weighting of the critical bands contained in the internal rep-resentation of a speech signal. Two types of band-specific modulated-noisedistortions were generated. In two experiments the detectability and theperception of speech quality associated with these kinds of signal degrada-tion are investigated and compared with model predictions obtained withdifferent weighting characteristics.

The objective speech quality assessment methods considered so far arerestricted to stationary transmission conditions and most of them are op-timized in the frequency domain. Chapter 4 extends the approach to thetime domain. A method for time-continuous assessment of speech qualityis introduced. Experiments are described that address the question whethersubjects are able to assess the perceived time-varying quality of speech ma-terial continuously. Stimuli with a controlled time-varying speech qualitydegradation were generated and their speech quality assessed continuouslyby subjects. The experimental results were compared with model predictionsby the objective speech quality measure. For this purpose the speech qualitymeasure described in chapter 2 was modified to allow for a time-continuousprediction of the speech quality.

Yet another possible application of the objective speech quality measureis presented in Appendix D: A hardware implementation of the auditoryprocessing model (e.g., on an integrated circuit “Silicon Ear”) can be evalu-ated and optimized by using the objective speech quality measure qC . Thedeviations between the “ideal” auditory processing model and the hardwareimplementation with fixed precision can be evaluated in a perceptually rele-vant way.

Page 13: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Chapter 2

Objective measurement ofspeech quality with apsychoacoustically validatedauditory model 1

Abstract

A new objective measure for the transmission quality of low-bit rate speech codingalgorithms is described and tested. A quantitative psychoacoustical signal process-ing model is employed to objectively measure the perceptually relevant deviationsbetween the transmitted, degraded signal and the corresponding reference signal.The inherent parameters of the auditory processing model were derived directlyfrom psychoacoustical data independent of the present study. The processing isapplied to transform the coded (distorted) and the corresponding original speechsignal to an internal representation which is thought of as the information that isaccessible to higher neural stages of perception. From a comparison of these inter-nal representations a quality measure can be derived that shows a high correlationto the subjective Mean Opinion Score data of various test data bases.

The auditory processing model was previously applied to model detectionthreshold of psychoacoustical masking experiments (Dau et al., 1996a; 1996b). Itconsists of a linear gammatone-filterbank with center frequencies between 350 Hzand 3.8 kHz as the first stage, followed by a halfwave rectification and low-passfilter at 1 kHz modeling the haircell transformation. A nonlinear adaptation stageaccounts for realistic dynamic compression and temporal masking effects. Theresulting objective speech quality measure qC shows a high performance in the

1A modified version of this chapter has been submitted for publication in the J. AudioEng. Soc.: Hansen and Kollmeier (1998) “Objective modeling of speech quality with apsychoacoustically validated auditory model”

5

Page 14: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

6 Chapter 2: Objective speech quality measurement

prediction of the MOS data of various low bit-rate coded speech test data bases, ifa frequency-dependent weighting is applied which exhibits increasing weights forincreasing center frequencies of the filter channels of the internal representation.The influence of the processing stages on the performance of qC was investigatedby comparing them with alternative models for the dynamic compression and withthe PSQM model from the literature. The parameters of the original processingmodel optimized for psychoacoustical modeling yields also the optimal performancein prediction subjective speech quality data.

Page 15: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.1 Introduction 7

2.1 Introduction

Within the last 15 years, several aspects and models of the signal pro-cessing in the auditory system have been applied to methods for objectivespeech quality measurement (Karjalainen, 1983; Quackenbush et al., 1988;Wang et al., 1991; Herre et al., 1992; Baillard et al., 1992; Beerends andStemerdink, 1994b; Colomes et al., 1994; Hollier and Hawksford, 1995;Hansen and Kollmeier, 1997b). In general, all quality assessment methods,subjective and objective, aim to quantify the quality (or the quality degra-dation) of a speech sample relatively to an undegraded reference situation.In some listening tests, the reference situation may be given implicitly bythe instruction and the common knowledge and quality expectation of thesubject. In comparison methods and degradation assessment methods, thereference situation is presented to the subject explicitly. In all objectivemethods known to the author, the corresponding reference signal has to besupplied explicitly with each test signal in order to provide the necessary“world-knowledge”, that a test subject generally has. The underlying idea inmost of these methods is to compare the test signal and the reference signal onthe basis of the so-called internal representation of the two signals. The twosignals are transformed to a psychoacoustically motivated representation byan appropriate transformation, also called “auditory processing”. The trans-formation incorporates the knowledge of the signal processing which takesplace in the periphery of the auditory system. It should model aspects ofspectral and temporal masking. The internal representation is thought of tocontain the information that is available to higher neural stages of perception.

Most of the above mentioned methods try to reach a psychoacousticallymotivated signal representation by two main processing stages:

• Spectral analysis with transformation of the linear frequency scale to aBark- or ERB scale, or summation of energies within critical bands,

• Nonlinear transformation/compression of intensity I to loudness L,most commonly by a power law of the form L = Iα.

Schroeder et al. (1979) introduced the concept of a masked threshold.Distortions with a level below this masked threshold are considered to beinaudible and thus do not contribute to a speech quality degradation. Thisconcept is also employed in the Noise-to-Mask-Ratio (NMR) measure (Herreet al., 1992). Beerends and Stemerdink introduced a way to model forwardmasking and nonlinear frequency masking to their original perceptual speechquality measure (PSQM) (Beerends and Stemerdink, 1994b). However, in the

Page 16: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

8 Chapter 2: Objective speech quality measurement

later version that was standardized by the ITU, these part of the processingwere discarded again (ITU-T, 1996b).

Most of the models have a set of parameters that have to be adjustedwith the aim to maximize the correlation between the objective and subjec-tive data. The PSQM, e.g., contains also a parameter for the weighting ofsilent versus speech intervals. The parameter optimization may depend onthe actual test material in a data base, so that a high correlation may notbe achieved with one set of parameters. For some models, the parameters ofthe psychoacoustical model may be optimized to values far away from valuesfound in the original application. For example, in the PSQM the exponentfor loudness compression was optimized to α = 0.001 instead of α = 0.23which is found for loudness of stationary sounds. Beerends refers thereforeto “compressed loudness”, because it is not in line with psychoacousticalmodels for loudness perception. However, the PSQM is still successful inpredicting speech quality. Often, models with a relatively simple auditorysignal representation are accompanied by a rather complicated calculation ofthe final distance measure which might be responsible for the good perfor-mance of the complete method.

In this study, the focus of the investigations is put on the auditory process-ing algorithm that aims to describe the perception strategies of the auditorysystem by employing an “effective” model of the signal processing. Free pa-rameters should be motivated from psychoacoustical modeling and be fixedin future. The auditory processing should allow to form a general model forsignal detection and speech perception with a wide applicability.

In recent studies by Dau et al. (Dau et al., 1996a; 1996b) such a func-tional model has been presented which describes the signal processing of asound signal in the auditory system. The model incorporates physiologicaland psychoacoustical knowledge and has been applied to predict detectionthresholds in a wide variety of psychoacoustical experiments, both involv-ing temporal and spectral aspects of masking found in the auditory sys-tem (Dau et al., 1996a; 1996b; Munkner, 1993; Fassel, 1994; Sander, 1994;Verhey and Dau, 1997; Kortekaas, 1997).

The success of the signal processing model in the prediction of thresholddata suggests that the model extracts the perceptually relevant informationfrom the input signal. However, it leads directly to the question whetherthis information can “only” be used to predict threshold data, or whetheralso more complicated phenomena of sound and speech perception, whichare presumably located at higher and more cognitive stages of the hearingprocess, can still be modeled with this information.

An adapted version of the model could be applied to predict speechintelligibility measurements (Holube, 1993; Wesselkamp, 1994). The pre-

Page 17: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.2 Subjective speech quality tests 9

diction of intelligibility with the processing model might still be viewedas a task that takes place near a certain threshold. The typical testsignal contain the speech signal with additive noise at a certain SNR.The model predicts the SNR that corresponds to 50% intelligibility ofthe material. This could be modeled similarly to the detection of testtones in a masker signal. Another recent study showed, that the audi-tory processing could also be applied as a feature extraction front endfor automatic speech recognition of isolated words (Tchorz et al., 1996;1997).

In the present study, the application of the model to objective speechquality measurement of low-bite rate speech codecs is described. This fieldis a clear supra-threshold phenomenon of perception.

2.2 Subjective speech quality tests

The employed subjective listening experiments were carried out at the re-search center of the Deutsche Telekom AG. The speech data material andthe MOS data gained from an ACR assessment method were provided by theresearch center of Deutsche Telekom AG, Berlin.

The distorted speech signals were generated by several different low-bit-rate speech coding-decoding devices (“codecs”) such as used in mobile tele-phony. These codecs produce a speech signal that is fully intelligible andallows for almost normal speaker identification, compared to standard tele-phony, but exhibit a clearly reduced speech quality due to their highly non-linear and/or time-variant speech processing.

The processed speech files were presented to the subjects in a listening-only test via a telephone handset at a listening level of 79 dB SPL. 12 maleand 12 female, paid, naive subjects participated in the tests. Their taskwas to rate the overall speech quality with an ACR method. The subjectiveMOS data for the overall quality were calculated from the ratings of the 24subjects for each processed speech file.

Speech test material from four different test data bases was considered,i.e.,

• ETSI Halfrate Selection test (Phase II, 1992 / Exp. 1, IM 4)

• ITU 8kbit Test (Exp. 1)

• Test data base cascaded ADPCM, simulated-net connection

• Test data base cascaded ADPCM, real-net connection

Page 18: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

10 Chapter 2: Objective speech quality measurement

The first two data bases result from tests carried out in many differentlanguages and laboratories to evaluate and select new standard codec algo-rithms with respect to their subjective speech transmission quality. The lasttwo data bases result from internal tests at Deutsche Telekom AG. They aredescribed in more detail in Berger (1996).

All four databases consisted of German language sentences material,recorded from two male and two female speakers. In each data base, dif-ferent codecs, partly operated in different conditions (digital input level,transmission bit errors) or in tandem mode, were used to process the orig-inal speech data files. In each data base, also a set of speech files thatwere processed by a modulated noise reference unit (MNRU) (CCITT, 1989;Schroeder, 1968) was incorporated. The MNRU serves as a reference systemused to compare the results from different experiments. It is also used tospan the total range of qualities in the test data material.

Each speech file consists of two different sentences separated by a shortpause. The total duration of each speech file is 8 s. The ETSI test materialwas sampled at 16 kHz, while the sampling frequency was 8 kHz for theother three data bases. The material was subsequently telephone-band-passfiltered by the modified Intermediate Reference System (IRS)-filter (ITU-T,1996c). After processing, all speech files were calibrated to have the sameActive Speech Level (ASL) (ITU-T, 1996c) of -30 dBov2.

2.3 Method for objective speech quality mea-

surement

2.3.1 Prealignment

In general, each processed sentence exhibits an overall delay and possibly adifference in level compared to the reference sentence. These (perceptuallymostly irrelevant) deviations have to be eliminated prior to computing andcomparing the corresponding internal representations of both sentences.

For the level alignment, a constant scaling factor was employed. A time-or frequency dependent level alignment would otherwise be indistinguishablefrom those alteration of the test signal that stem from the coding and trans-mission process that have to be detected by the model. This scaling factoris chosen so that reference sentence and test sentence have equal RMS. Thismethod assumes that a codec does not introduce slow long term drifts in its

2dBov is defined as the level relative to that of a fullrange (digitized) DC-signal. Afullrange sinusoid has a level of -3 dBov.

Page 19: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.3 Method for objective speech quality measurement 11

overall gain. While this assumption is met for the test material used here,it will not generally be fulfilled, e.g., for realistic time-varying transmissionconditions (cf. chapter 4).

Different methods have been investigated for estimating the optimumdelay for the temporal alignment. They are based on the cross correlationfunction where the delay of the maximum value corresponds to the time lagof a linear time-invariant system.

However, for some codecs in the data bases substantial differences arefound in the estimation of the delay calculated from different signals. Theconsistency of the delay estimation can be increased by calculating the crosscorrelation function of the envelope signals, or by averaging over short-timedelay estimations. Still, in sum a fully automatic delay estimation can onlybe done for the MNRU-processed signals. For the calculations presentedhere, fixed, tabulated delay times have therefore been used. The referencesentence was delayed by these delay times relative to the test sentence.

2.3.2 Concatenation

Since in the subjective speech quality tests the undistorted reference sentenceis not presented to the listener, the general “pleasantness” of the voice inthe source signal may influence the rating of the distorted speech signal.Moreover, after coding, the voice of one speaker may be rated systematicallyworse in speech quality than the voice of another speaker. Presumably, thatthese systematic differences are only weakly related to physical parametersfound in the corresponding sound signals. To account for the influence of thedifferent speaker voices, each set of four sentences uttered by the differentspeakers and processed under the same coding condition is concatenated.The resulting concatenated signals are used as input for the objective speechquality measure. It is compared with the corresponding mean of the MOSof the four sentences.

2.3.3 Auditory processing: transformation to internalrepresentation

After prealignment and concatenation, the two speech signal are transformedto their “internal representations”, i.e., a two-dimensional time-frequencyrepresentation of both signals. It simulates the human ear’s transformationof acoustic signals into neural activity patterns that serve as input to highercognitive processes (such as, e.g., speech recognition). This transformationis performed by a model of the “effective” signal processing of the auditorysystem (Dau et al., 1996a; 1996b).

Page 20: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

12 Chapter 2: Objective speech quality measurement

frequencyweighting

adaptationloops

"internalrepresentation"

hair celltransduction

gammatonefilterbank

t1 t5

max dt0

20 ms

.

8 Hz1 kHz

Figure 2.1: Principle of the auditory processing model

Gammatone filterbank

The first stage in the auditory processing consists of a bank of linear bandpassfilters, i.e., 4th order gammatone filters (Patterson et al., 1987). The filtersin the filterbank have been optimized to match the equivalent rectangularbandwidth (ERB) of the auditory system found with a notched-noise maskingexperiment (Patterson, 1976; Patterson and Nimmo-Smith, 1980).

The ERB b of the filters on the basilar membrane is given as a functionof the center frequency f0 by the formula of Moore and Glasberg (1987)

b = l +f0

Q, with l = 24.7 Hz, Q = 9.265. (2.1)

The signal is split up into 19 critical band-pass signals with center fre-quencies from 350 Hz to 3.8 kHz, spaced linearly on the ERB frequency scalewith one filter per ERB.

Haircell model

Subsequent to the filterbank, each channel is subjected to a halfwave rec-tification and lowpass filtering. This stage models the transformation frommechanical motion of the basilar membrane to neural firing rate of the innerhaircells. The inner haircells (IHC) are sheared by the fluid motion due tothe relative oscillation of the basilar membrane and the organ of Corti. Theinner haircells do only emit an action potential when bended into one di-rection. This is modeled by the halfway rectification. For signal frequenciesup to about 1 kHz, the phase of the signal is coded in the primary auditory

Page 21: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.3 Method for objective speech quality measurement 13

fibers (Pickles, 1988). This is modeled by a 1st-order lowpass filter at 1 kHz.The system of halfway rectification and low pass filter preserves essentiallythe envelope of the signal for high (> 1 kHz) center frequencies.

Nonlinear adaptation

The second major component in the processing accounts for effects of tem-poral adaptation and dynamic compression, that are modeled by feedbackloops. They have been introduced by Puschel to quantitatively model tem-poral masking effects (Puschel, 1988).

The adaptation stage is depicted in the center part of Fig. 2.1. It consistsof a chain of five consecutive feedback loops with time constants rangingfrom 5 to 500 ms. To account for the absolute hearing threshold (which isassumed to be 100 dB below the maximum input level), the input to the firstloop is limited to a corresponding lower bound by a maximum operation. Ineach loop, the input forms the dividend of a dividing element, the quotient ofwhich is lowpass filtered and fed back as the divisor. For a stationary input(and output) signal, such a loop computes the square-root of the input. Fora chain of five loops, the output O for a stationary input I is transformed

according to O = 25√I, which is very close to a logarithmic compression. This

close-to-logarithmic compression is one reason why five loops are required topredict several psychoacoustical effects with the same model.

Note, that the input to the adaptation stage is an envelope signal.Changes in the input signal, that are rapid in comparison to the time con-stants of the low-pass filters, are transformed approximately linearly. How-ever, if the fluctuations of the input are slow, the low-pass filters will adaptaccordingly. Thus, the sensitivity of the system depends on the input signalat previous times. The time constants determine, how fast the system willcome to a new stationary state.

The output of the fifth loop is linearly scaled by y = mx + b to mapstationary input levels I ∈ [0..100] dB to “model units” MU ∈ [0..100]. Notehowever, that non-stationary input signals, e.g., at signal onsets and offsets,can result temporarily in output levels clearly above 100 MU.

Finally, the output of the adaptation stage is filtered with a first orderlow-pass at 8 Hz. Due to this filter the temporal resolution for envelopefluctuation is reduced. The time constant of this filter was optimized byDau et al. to account for temporal integration in detection and maskingexperiments (Dau et al., 1996b).

Page 22: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

14 Chapter 2: Objective speech quality measurement

2.3.4 Distance measure

Subsequent to the low-pass filtering in the adaptation stage, the informationinherent in the representations is downsampled. This is implemented by atemporal integration across segments of τ =20 ms duration with no overlapbetween adjacent segments. The lengths τ is oriented at the typical framerate used in the analysis and synthesis algorithms in speech coders. Note,that this downsampling is not performed in the standard processing modelwhere test signals may be analyzed with much shorter duration than 20 ms.

The apparent differences in the internal representations of the referencesignal and the test signal have to be combined into a distance or similaritymeasure by an appropriate algorithm which characterizes the overall percep-tual similarity or dissimilarity of the two signals. Two different measures havebeen investigated that have been kept relatively simple. This approach dif-fers from other objective speech quality measures, where considerable efforthas been spent on the distance measure (Beerends and Stemerdink, 1994b;1994a; ITU-T, 1996b).

Let Xi,j and Yi,j denote the internal representations of reference sentenceand test sentence, respectively, at times t = i ·∆t, i = 1 . . .N and centerfrequencies f = f0 + j ·∆f, j = 1 . . .M . To account for the fact, that theindividual bands of the signal representation may have different importancefor the perceived speech quality, a set of linear weights wj was applied to thebands of the two representations. These weights transform the representa-tions Xi,j and Yi,j into Xw

i,j = wj · Xi,j and Y wi,j = wj · Yi,j. The frequency

dependence of the weights is subject to an optimization with the aim to max-imize the performance of the speech quality prediction (see section 2.5.1).

The optimal weighting found for speech quality prediction in this study isshown in Fig. 2.2. The weights have a flat characteristic for center frequenciesup to 1 kHz, or 15.7 ERB. For higher bands, the weights increase, withthe highest weight on the band with the highest center frequency. Thischaracteristic differs considerably from weighting functions used, e.g., forthe Articulation Index (AI) (French and Steinberg, 1947) and the SpeechTransmission Index (STI) (Steeneken and Houtgast, 1980) for the predictionof speech intelligibility.

Mean squared difference

The mean squared “difference measure” qS is defined by

qS =1

N ·MN∑i=1

M∑j=1

(wjXi,j − wjYi,j)2 (2.2)

Page 23: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.3 Method for objective speech quality measurement 15

0

0.2

0.4

0.6

0.8

1

8 10 12 14 16 18 20 22 24 26

wei

ght

frequency [ERB]

Band importance weighting characteristic

Figure 2.2: Band weighting characteristic

The averaging is performed across the time- and frequency axis. In thisdefinition of qS the weighted representation is regarded to consists of timeand frequency components wjXi,j (or wjYi,j) that contain equal amounts ofinformation. A high value of qS indicates a large difference between testsentence and reference sentence and should therefore correspond to a lowersubjective speech quality of the test sentence. This measure has been inves-tigated because it is the most widely used error measure which is also usedin most other objective speech quality measures as well. However, it is notclear beforehand that it will give a good performance for the current pur-pose. For example, exponents other than 2 in eq. (2.2) might give a betterperformance.

Linear correlation coefficient measure

The linear correlation coefficient measure qC (denoted as “correlation mea-sure”) is defined as

qC =

N,M∑i,j=1

(wjXi,j −XV ) · (wjYi,j − YV )√∑i,j

(wjXi,j −XV )2√∑i,j

(wjYi,j − YV )2, (2.3)

where XV and YV denote the means of wjXi,j and wjYi,j.

The summation∑i,j is to be performed for all possible pairs i, j.

qC is normalized and ranges between -1 and 1. A small value of qCindicates large differences between test sentence and reference sentence andshould therefore correspond to a lower speech quality of the test sentence.

Page 24: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

16 Chapter 2: Objective speech quality measurement

As for the difference measure qS, also in the correlation measure qC alltime-frequency components Xw

i,j are regarded to carry independent portionsof information.

This measure has been chosen, because it is closely related to the measureused by Dau in the optimal detection stage of his model (Dau et al., 1996a).The relation between qC and Dau’s log likelihood measure, which results inthe calculation of a correlation, is discussed in chapter 3 and in Appendix Bof this thesis.

2.4 Objective speech quality prediction re-

sults with the standard auditory process-

ing

For the test data bases described in 2.2 the objective speech quality mea-sures qS and qC were calculated for each sentence sequence (cf. 2.3.2). Inthe following figures, this objective measure is plotted as a function of thesubjectively measured mean MOS. The subjective data are averages acrossthe sentences of four different speakers of mean-values of 24 test subjects.For some individual codecs, the MOS may be a measure with a consider-able standard deviation in the range of 0.7 or even up to 1.0 which is oftennot published in the subjective results data base. Each symbol in the plotsrepresents a certain codec within one test. The reference system MNRU, atvarying modulation depth from -50 dB to 0 dB, is always indicated by thesymbol ’m’. For the ETSI test data base, the symbols ‘1’ to ‘6’ represent thesix different codecs within the test. For the ITU-8kbit test data base, thenumbers ‘1’ to ‘3’ indicate a single codec, and double- and threefold tandem-ing of the codecs while the letters indicate the different codecs of the test.For the two ADPCM cascading tests, the letters indicate different combina-tions and data rates of the ADPCMs, varying from a cascade of 64-16-16-64kbit/s ADPCMs up to 64-32-64 kbit/s.

As a measure for the performance of an objective speech quality measure,the (linear) correlation coefficient r or the Spearman rank correlation coef-ficient rs were calculated between the set of objective and subjective dataobtained for one complete test data base 3. Because the data do not deviatesignificantly in shape from a straight line, the data were fitted by a first or-der polynomial fit. The standard deviation (SD) of the data from this fitted

3Note, that these two correlation coefficients must not be mistaken with the qualitymeasure qC , which is also a correlation coefficient, but calculated between each of twocorresponding signal representations.

Page 25: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.4 Objective speech quality prediction results 17

function is also given in the figure legends.

Results with the correlation quality measure qC

The results of the ETSI test data base with the qC measure are plotted inFig. 2.3. In this figure, a monotonic relation between objective and subjective

1

1.5

2

2.5

3

3.5

4

4.5

5

0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

MO

S

objective speech quality qC

rs: 0.917r: 0.937

SD: 0.257

11

1

11

1

222

2

22

333

3 3

3

4444

4

4

5

555

55

66

66

6

6

m m

m

m

m

m

m m

Figure 2.3: ETSI test data base: Speech quality prediction with the stan-dard auditory processing model. The quality measure is calculated with thecorrelation measure qC .

data is observed. The data lie within a narrow region and show a smallstandard deviation. The linear correlation coefficient is r=0.937, the rankcorrelation coefficient is rs=0.917 and the standard deviation amounts toonly SD=0.257 MOS. These values indicate a very high performance of theobjective speech quality measure. Note, that no clearly separated clustersfor certain groups of codecs or the MNRU are observed in the data: Thequality of the MNRU system alone can easily be predicted by more simpleS/N measures, but the prediction of the speech quality for a group of differentcodecs is generally more difficult.

The ITU-8kbit data for qC are shown in Fig. 2.4. The performance of qCis not as high as for the ETSI data, but still a correlation coefficient of r =0.878 and a standard deviation SD=0.372 are reached. For the three MNRUsystems with the lowest speech quality, a slight deviation in subjective qualityfrom the group of codecs can be observed. For the group of codecs alone, a

Page 26: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

18 Chapter 2: Objective speech quality measurement

smaller standard deviation of 0.306 is found, while the correlation coefficientis reduced to 0.858.

1

1.5

2

2.5

3

3.5

4

4.5

5

0.8 0.85 0.9 0.95 1

MO

S

objective speech quality

rs: 0.875r: 0.878

SD: 0.3711a

1b

1c

1d

1e

1f

1g

1h 1i

1j

2a2b2c

2d2e

2f2g

2h2i

2j3a

3b

3c

3d3e

3f3g

3h

3i3j3k

3l3m

3n

3o3p

mm

m

m

m

mmm

Figure 2.4: ITU 8kbit test data base: Speech quality prediction by qC withthe standard auditory processing model.

The qC results for the two ADPCM cascading test are shown in Figs. 2.5and 2.6. The speech material of these two tests differ in that the first testwas carried out with an end-to-end connection that was simulated by another(constant) ADPCM device at 32 kbit/s, while in the second test the codeccombination under test was connected with a real-net long distance telephoneline. For the first test with a simulated-net connection, a high correlationr = 0.91 is reached and the rank order is highly well predicted (rs = 0.94).The standard deviation is SD=0.391. In summary, a good speech qualityprediction is performed with the qC measure for this test. For the test withreal-net connections involved, however, the results are worse (cf. Fig. 2.6).The linear correlation is as low as 0.85 which is not satisfactory for theprediction of the whole set of data. Correspondingly the standard deviationis also higher (SD=0.461). The reason for the poor performance is not clear.However, it should be noted, that generally a poorer performance of objectivespeech quality measures is observed for test signals that show more realistic“real-world” distortions

Page 27: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.4 Objective speech quality prediction results 19

1

1.5

2

2.5

3

3.5

4

4.5

5

0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

MO

S

objective speech quality qC

rs: 0.943r: 0.909

SD: 0.392

a

aa

bbb

cc

c

dd

d

ee

e

fffg

g g

h

h h

iii

j

jj

k

kk

m

m

m

m

m

m m

Figure 2.5: ADPCM cascading simulated-net connection test data base:Speech quality prediction by qC with the standard auditory processing model.

1

1.5

2

2.5

3

3.5

4

4.5

5

0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

MO

S

objective speech quality qC

rs: 0.897r: 0.865

SD: 0.440

aaa

b

bb

c

ccdd

d

ee

e

f

ff

g g

gh

h

h

ii

i

j jj

k

kk

m

m

m

m

m

mm

Figure 2.6: ADPCM cascading real-net connection test data base: Speechquality prediction by qC with the standard auditory processing model.

Page 28: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

20 Chapter 2: Objective speech quality measurement

Results with the squared difference quality measure qS

The results of the ETSI test are shown in Fig. 2.7. The performance of thespeech quality prediction is almost as good as with the correlation measureqC . The correlation coefficient amounts to r =-0.926, rs =-0.904 and thestandard deviation is slightly higher at SD=0.278. This may be caused bythe fact, that the data deviate a little more from a straight line than inFig. 2.3, especially for the MNRU data at the low-end side of the speechquality.

1

1.5

2

2.5

3

3.5

4

4.5

5

1000 2000 3000 4000 5000 6000 7000 8000 9000

MO

S

objective speech quality qS

rs: -0.901r: -0.927

SD: 0.276

11

1

11

1

22 2

2

22

333

33

3

444

44

4

5

5 55

55

66

66

6

6

mm

m

m

m

m

mm

Figure 2.7: ETSI test data base: Speech quality prediction by the squareddifference measure qS with the standard auditory processing model.

For the ITU 8kbit test, the performances of qC and qS (in Fig. 2.8), areequally well. For the squared difference measure qS, the linear correlation isr =-0.893, rs =-0.873 and the standard deviation SD=0.35.

For the two ADPCM cascading tests, the results with qS are plotted inFig. 2.9 and 2.10. In these figures, a slightly smaller performance of thespeech quality prediction is found as with the correlation measure qC . Forthe simulated-net connection, the correlation is r =-0.897, rs =-0.9351 andstandard deviation SD=0.417. For the real-net connection test, r =-0.821,rs =-0.873 and SD=0.501 are observed. Again, the performance is poor forthe real-net connection test, as was observed for the measure qC . Especiallyat the low end of the quality scale, an intolerable spread in the objectivedata is found for different codecs with approximately equal subjective qualityrating.

Page 29: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.4 Objective speech quality prediction results 21

1

1.5

2

2.5

3

3.5

4

4.5

5

2000 4000 6000 8000 10000

MO

S

objective speech quality qS

rs: -0.875r: -0.893

SD: 0.3501a

1b

1c

1d

1e

1f

1g

1h1i

1j

2a2b2c

2d2e

2f 2g

2h2i

2j3a

3b

3c

3d3e

3f3g

3h

3i 3j3k

3l3m

3n

3o3p

mm

m

m

m

mmm

Figure 2.8: ITU 8kbit test data base: Speech quality prediction by thesquared difference measure qS with the standard auditory processing model.

1

1.5

2

2.5

3

3.5

4

4.5

5

1000 2000 3000 4000 5000 6000 7000 8000

MO

S

objective speech quality qS

rs: -0.933r: -0.895

SD: 0.420

a

aa

bb b

cc

c

dd

d

ee

e

ff f ggg

h

hh

iii

j

jj

k

kk

m

m

m

m

m

mm

Figure 2.9: ADPCM cascading simulated-net connection test data base:Speech quality prediction by the squared difference measure qS with thestandard auditory processing model.

Page 30: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

22 Chapter 2: Objective speech quality measurement

1

1.5

2

2.5

3

3.5

4

4.5

5

1000 2000 3000 4000 5000 6000 7000 8000 9000

MO

S

objective speech quality qS

rs: -0.896r: -0.842

SD: 0.474

aa a

b

b b

c

c c dd

d

ee

e

f

f f

gg

gh

h

h

ii

i

jj j

k

kk

m

m

m

m

m

mm

Figure 2.10: ADPCM cascading real-net connection test data base: Speechquality prediction by the squared difference measure qS with the standardauditory processing model.

In summary, the results show, that the standard auditory processingmodel with the correlation measure qC yields a very good performance in theETSI test, a good performance in the ITU 8 kbit test and the simulated-netADPCM cascading test and still a fair performance in the real-net ADPCMcascading test.

It is note-worthy, that the auditory processing was not changed in anyof the parameters derived from psychoacoustical masking experiments. Theonly new feature is the introduction of the band weighting directly prior tothe calculation of qC or qS.

2.5 Discussion

2.5.1 Effect of band importance weighting

A band importance weighting function was introduced within the final qual-ity measure. Its influence on the performance of qC has been been furtherinvestigated.

A band importance weighting might be necessary, because the task ofrating the quality of a clearly-above threshold distortion in a test signal iscompletely different from the task of detecting a just noticeable change in a

Page 31: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.5 Discussion 23

test signal. The need for a band weighting is also suggested from weightingfunctions that were introduced within the AI and STI. In these two measures,a weighting is imposed on the signal-to-noise ratios in different bands of thesignals. These signal-to-noise ratios are measures on a logarithmic (dB)scale. These weightings have the highest weight on the middle bands andlower weights at the high and the low frequency end.

In our case, the internal representation, i.e., the output of the auditoryprocessing model, is compressed with an approximately logarithmic charac-teristic for stationary input signals (see section 2.3.3). The optimal weightingfor speech quality prediction, however, has increasing weights with increasingcenter frequency, with the highest weight on the highest band.

The weighting function presented in Fig. 2.2 is the outcome of an opti-mization with the aim to maximize the correlations r and rs and to min-imize the standard deviation between the objective and subjective qualitymeasures. A number of different types of weighting characteristics were in-vestigated and compared with respect to their effect on the performance ofthe resulting objective speech quality measure. The different band weight-ing functions were chosen to represent stylized forms of high-pass, low-pass,band-pass functions, etc. with an edge frequency of 1 kHz, or 15.6 ERB.In most cases also the inverse weights 1/wj were investigated as anotherweighting function. The set of functions under investigation are shown inFig. 2.11.

Each of these weighting functions was individually applied to the internalrepresentations and the correlation speech quality measure qC was calculatedfor the ETSI and the ITU 8kbit test data base. For each weighting functionthe correlations r and rs and the standard deviation SD were calculated asa measure of the performance of the weighting function. The results of thiscalculation are shown in Table 2.1 for the ETSI test and in Table 2.2 for theITU 8kbit test.

No weighting at all, i.e., constant weights wj = 1, j = 1 . . . 19 (“const”in Fig. 2.11) clearly shows a poor performance. The results of the Tables 2.1and 2.2 systematically indicate the necessity for a weighting characteristicthat increases with band center frequency. Note the performance with theweighting “ch19w5”: If all weights are kept constant except for only theweight for the highest band, being increased to w19 = 5, the performanceof the speech quality measure is already raised considerably. For weightingswith opposite characteristics, i.e., relatively low weights wj for the higherband center frequencies j, a poorer performance is observed than for noweighting at all. From a comparison of the different functions, it can beconcluded that the weighting function “low1k-inv”, which is the inverse ofthe stylized characteristic of a low-pass at 1 kHz, performs best for the two

Page 32: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

24 Chapter 2: Objective speech quality measurement

00.20.40.60.8

11.2

10 15 20 25

high1k

00.20.40.60.8

11.2

10 15 20 25

low1k

00.20.40.60.8

11.2

10 15 20 25

band1k

0

2

4

6

8

10

10 15 20 25

high1k-inv

0

2

4

6

8

10

10 15 20 25

low1k-inv

0

2

4

6

8

10

10 15 20 25

band1k-inv

00.20.40.60.8

11.2

10 15 20 25

const

00.20.40.60.8

11.2

10 15 20 25

step

0

0.5

1

1.5

2

10 15 20 25

triup

0

2

4

6

8

10

10 15 20 25

triup-inv

0

0.5

1

1.5

2

10 15 20 25

tridown

0

2

4

6

8

10

10 15 20 25

tridown-inv

0123456

10 15 20 25

exp1

0123456

10 15 20 25

exp2

01234567

10 15 20 25

exp3

00.20.40.60.8

11.2

10 15 20 25frequency [ERB]

iso40

0123456

10 15 20 25frequency [ERB]

iso40-inv

0123456

10 15 20 25frequency [ERB]

ch19w5

Figure 2.11: Different frequency characteristics for the weighting of the in-ternal representation. The weight is plotted as a function of the center fre-quency on the ERB scale. The functions with suffix “-inv” are the inverse ofthe respective weights with the same prefix name. All characteristics wereinvestigated with respect to the resulting performance of the measure qC .

Page 33: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.5 Discussion 25

weight r rs SD

high1k 0.803 0.810 0.439low1k 0.662 0.729 0.552

band1k 0.734 0.724 0.501high1k-inv 0.539 0.638 0.620low1k-inv 0.937 0.917 0.257

band1k-inv 0.818 0.857 0.424const 0.739 0.778 0.496step 0.829 0.836 0.412

triup 0.844 0.856 0.395triup-inv 0.528 0.635 0.625tridown 0.607 0.689 0.586

tridown-inv 0.935 0.915 0.262exp1 0.906 0.907 0.312exp2 0.878 0.883 0.352exp3 0.926 0.924 0.278iso40 0.714 0.734 0.516

iso40-inv 0.883 0.898 0.346ch19w5 0.918 0.901 0.291

Table 2.1: Results for the ETSI test data base with the standard auditoryprocessing model. Correlations and standard deviations for the performanceof different weighting function within qC quality measure are given. Theweighting characteristics appear in the same order as in Fig. 2.11.

Page 34: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

26 Chapter 2: Objective speech quality measurement

weight r rs SD

high1k 0.803 0.823 0.463band1k 0.763 0.817 0.503

low1k 0.701 0.757 0.554high1k-inv 0.574 0.643 0.637low1k-inv 0.878 0.875 0.371

band1k-inv 0.788 0.805 0.479const 0.762 0.807 0.503step 0.808 0.812 0.458

tridown 0.650 0.710 0.591tridown-inv 0.878 0.873 0.374

triup 0.821 0.815 0.444triup-inv 0.562 0.633 0.643

exp1 0.853 0.842 0.405exp2 0.837 0.828 0.425exp3 0.869 0.856 0.385iso40 0.744 0.792 0.519

iso40-inv 0.843 0.839 0.418ch19w5 0.861 0.856 0.396

Table 2.2: Results for the ITU 8kbit test data base with the standard pro-cessing model. Correlations and standard deviations for the performance ofdifferent weighting function within the qC quality measure are given.

Page 35: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.5 Discussion 27

test data bases. This function was already shown in Fig. 2.2. It was used tocalculate the results for qC and qS shown in Fig. 2.3–2.6 and in Fig. 2.7– 2.10.

2.5.2 Optimal band weighting: Are high weights nec-essary because of frequency band limitations?

It may be argued that the necessity of considerably high weights for the highfrequency bands results just from the fact, that the auditory filter bands atthe first stage of the auditory processing model do not cover the entire audiblefrequency range. The filters were chosen to cover the range of the telephonetransmission band from 300 to 3400 Hz. A typical telephone transmissionchannel acts as relatively steep filter with these edge frequencies. In the caseof the present model, the filter center frequencies were chosen to range from340 Hz to 3790 Hz. This means, that the center frequency of the highestfilter is above the highest signal frequencies with relevant spectrum level.

However, it still might be possible that high-frequency energy falls intoeven higher auditory filters due to upward spread of excitation. This spec-tral region at the upper edge of the telephone-band may convey importantinformation about the speech signal, such as the speech quality. One mighttherefore argue that a high weight on the highest filters is needed for opti-mal speech quality prediction, because perceptually important informationabout the excitation of the basilar membrane at even higher frequencies isonly represented in the highest filter(s) within the present filter bank.

The effect of spectral coverage at higher bands of the signal representa-tion was therefore investigated in a further set of model calculations. Theauditory processing model was modified to contain a gammatone-filterbankconstructed in the same way as before, except that two extra high-frequencychannels were added with center frequencies of 4250 Hz and 4780 Hz, re-spectively. With this processing model, calculations of the objective speechquality measure qC were carried out and the performance in terms of r, rs,and SD was tested for the ETSI Halfrate Selection Test and the ITU 8kbittest material.

Two different weighting functions were investigated. The first one was theconstant weighting w1 = w2 = . . . = w21 = 1. If the above argument holdstrue, this weighting should result in a clearly improved performance of thespeech quality prediction because extra information on high frequency exci-tation is now contained in the signal representations. The second weightingfunction was identical to the weighting “low1k-inv” for the first 19 weightsw1 . . . w19, but the two highest weights were set to w20 = w21 = 1. Thisweighting makes the signal representation contain the same information on

Page 36: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

28 Chapter 2: Objective speech quality measurement

the two highest band as in the constant weighting. The two weighting func-tions are plotted in Fig. 2.12.

00.20.40.60.8

11.2

10 15 20 25frequency [ERB]

const + 2 ch.

0

2

4

6

8

10

10 15 20 25frequency [ERB]

low1k-inv + 2 ch.

Figure 2.12: Frequency characteristics of the two weighting functions coveringtwo bands at higher frequencies. The weight is plotted as a function of thecenter frequency on the ERB scale.

The results of an objective speech quality measurement for these twoweighting function are given in Table 2.3. Only a marginal improvement isfound for the constant weighting due to introducing two more high-frequencychannels. The correlation coefficient found for qC with constant weightingand extra high-frequency channels is still clearly smaller than for the increas-ing weighting “low1k-inv” (with or without two extra high-frequency bands).This shows, that the assumption on the necessity ox extra high-frequency in-formation does not hold.

weighting r rs SD

ETSI test const, w20 = w21 = 1 0.759 0.780 0.479l1k-inv, w20 = w21 = 1 0.933 0.914 0.265l1k-inv, (19 channels) 0.937 0.917 0.257

ITU 8kbit const, w20 = w21 = 1 0.778 0.808 0.488l1k-inv, w20 = w21 = 1 0.878 0.866 0.372l1k-inv, (19 channels) 0.878 0.875 0.371

Table 2.3: Correlations and standard deviations for the performance of qCcalculated from an internal representation which was extended to 21 channelsinstead of the standard 19 channels. Two different weighting functions wereinvestigated with two test data bases.

Page 37: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.5 Discussion 29

2.5.3 Alternative justification of the optimal bandweighting characteristic

It would be desirable to find an independent psychoacoustical justificationfor the optimality of the low1k-inv weighting. Experiments designed for thispurpose will be presented in chapter 3 of this thesis. However, the need forhigh weights at high center frequencies of the internal representation can beexplained by a phenomenological argument.

In Fig. 2.13 the subjective MOS as a function of the qC data for the ETSItest data base are shown for the constant and for the low1k-inv weighting,both based on 19 channels of the internal representation. Obviously, the

1

1.5

2

2.5

3

3.5

4

4.5

5

0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

MO

S

qC

11

1

11

1

222

2

22

333

3 3

3

444

44

4

5

555

55

66

66

6

6

m m

m

m

m

m

mm

1

1.5

2

2.5

3

3.5

4

4.5

5

0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

MO

S

qC

11

1

11

1

222

2

22

333

3 3

3

4444

4

4

5

555

55

66

66

6

6

m m

m

m

m

m

m m

Figure 2.13: Results of qC with constant band weighting (left panel) andwith low1k-inv weighting (right panel) for the ETSI Halfrate Selection Test.

low1k-inv weighting shifts the data towards lower values of qC . Also, theemployment of the low1k-inv weighting leads to a larger decrease of qC forthe MNRU sentences than for the codec-processed sentences. It appears, thatqC with constant weighting overestimates the speech quality of the MNRU-processed sentences compared to the codec-processed sentences with respectto the subjective MOS quality ratings.

This constitutes the main reason for the larger correlation coefficient be-tween qC and MOS if the “low1k-inv” weighting is used rather than theconstant weighting. Another reason is, that the correlation coefficient forthe data subset of the codec-processed sentences is larger with the low1k-invweighting than with the constant weighting. However, a major effect forimproved objective speech quality prediction by qC with low1k-inv weight-ing stems obviously from the effect, that the MNRU data group at mediumand low MOS values is shifted to lower values of qC , compared to constantweighting.

The reason why the low1k-inv weighting shows a larger effect for the

Page 38: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

30 Chapter 2: Objective speech quality measurement

MNRU processed sentences might be explained by the following argument(Houtgast, 1997): The speech codecs aim generally at producing a distortionnoise that is masked by the speech signal as much as possible. This will inparticular lead to a spectral shape of the distortion noise that follows mainlythe long term speech spectrum of the input speech signal. The MNRU,however, produces a distortion which is mainly spectrally flat since it resultsfrom the convolution of the speech spectrum with the wide band white noisemodulator of the MNRU.

-150

-140

-130

-120

-110

-100

-90

-80

-70

-60

0 1000 2000 3000 4000 5000 6000 7000 8000

leve

l [dB

]

frequency [Hz]

input speech signal to codec

-150

-140

-130

-120

-110

-100

-90

-80

-70

-60

0 1000 2000 3000 4000 5000 6000 7000 8000

leve

l [dB

]

frequency [Hz]

typical codec

Figure 2.14: Long term spec-tral energy densities of referencespeech signals (upper left panel)and resulting codec-processedsentence (upper right panel), andMNRU-processed sentences atQ=0 dB (lower right panel).

-150

-140

-130

-120

-110

-100

-90

-80

-70

-60

0 1000 2000 3000 4000 5000 6000 7000 8000

leve

l [dB

]

frequency [Hz]

MNRU at Q=0 dB

The typical long-term speech spectra of the reference sentences, a typicalcodec-processed sentences, and a MNRU-processed sentences at Q=0 dB areshown in Fig. 2.14. The three spectra in this figure stem from sentences of theETSI test data base. From this figure it is obvious, that the MNRU-processedand the codec-processed sentence differ mainly in the spectral region above2.5 kHz. Consequently, the MNRU-processed sentences exhibit larger dif-ferences relative to the corresponding reference sentence in this spectral re-gion than the codec-processed sentences. Note, that this spectral region ofthe internal representation is emphasized more strongly by the low1k-invweighting and similar weighting characteristics. This leads to a shift of thespeech quality measure (qC or qS) towards poorer quality which is larger forthe MNRU-sentences than for the coded-processed sentences. This shift is

Page 39: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.5 Discussion 31

necessary to increase the correlation of the MOS versus qC (or qS) data, asobserved from Fig. 2.13. This might be the explanation, why the weightingswith larger weights on the highest center frequencies of the internal represen-tation (such as the “low1k-inv” weighting) yield a higher performance of thespeech quality measure in terms of a larger correlation and smaller standarddeviation.

2.5.4 Optimal choice of parameters in adaptation loops

One important stage of the processing model is the chain of feedback loopsthat account for the compression and nonlinear adaptation of the incomingenvelope signal. The feedback loops were originally developed to model for-ward masking effects (Puschel, 1988). Dau (1992), Dau and Puschel (1993)and Dau et al. (1996b) adjusted the time constant slightly to fit the thresh-old predictions of the complete model to a variety of psychoacoustical exper-iments. The time constants that resulted from this fit were τ1 = 5, τ2 = 50,τ3 = 129, τ4 = 253, and τ5 = 500 ms. Note that, compared to a lineardistribution of the τi between the smallest and the largest time constant,here the 50 ms were chosen additionally and the time constant of 376 mswas discarded. The final first-order low-pass filter at the output of the nloops was set to τlp= 20 ms. Keeping this set of parameter constant, de-tection thresholds could be modeled for different kinds of simultaneous andnon-simultaneous masking experiments. However, the perception of speechquality is a supra-threshold phenomenon that does not necessarily requirethe same processing parameters as found for threshold data. It was thereforeinvestigated, whether a different set of time constants or even a different num-ber n of feedback loops would result in a better performance of the objectivemeasure qC (or qS).

For a number of modified parameter sets of the feedback loops, qC wascalculated for the ETSI test data base. In the version labeled “0” only τ1

was changed compared to the original version. In all other versions, labeled“1” to “29”, the n time constants were distributed linearly between τ1 and τnaccording to τi = τ1 + (i− 1)/(n− 1) · (τn− τ1). The correlation between qCand the subjective MOS data was calculated for all modified versions. Theresults of this performance analysis are summarized in Table 2.4. From thistable it is obvious, that only version “0” shows a slightly better performancein the speech quality prediction compared to the original version. All otherversions of of the parameter set of the feedback loops result in a decreasedperformance. It is observed, that the parameter τlp has the highest influenceon the performance of qC with this test data base. All versions with τlp =0.2 s result in a poorer performance than with τlp = 0.002 s, while the best

Page 40: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

32 Chapter 2: Objective speech quality measurement

version# n τ1 τn τlp r rs SD

original 5 0.005∗ 0.5∗ 0.020 0.937 0.917 0.2570 5 0.001∗ 0.5∗ 0.020 0.940 0.915 0.2511 5 0.005 0.5 0.020 0.934 0.896 0.2642 2 0.001 0.5 0.200 0.837 0.759 0.4033 3 0.010 0.2 0.200 0.753 0.727 0.4854 2 0.005 0.1 0.200 0.803 0.711 0.4395 5 0.020 0.5 0.200 0.832 0.787 0.4086 6 0.005 0.5 0.200 0.843 0.785 0.3967 7 0.005 0.5 0.200 0.855 0.800 0.3828 6 0.005 1.0 0.200 0.803 0.748 0.4399 6 0.001 1.0 0.200 0.811 0.751 0.431

10 5 0.005 0.5 0.002 0.916 0.909 0.29511 5 0.005 0.5 1.000 0.449 0.611 0.65812 5 0.005 0.5 2.E-6 0.906 0.898 0.31113 5 0.001 0.1 2.E-4 0.783 0.789 0.45814 2 0.010 0.1 0.002 0.925 0.919 0.28015 3 0.010 0.2 0.002 0.911 0.902 0.30416 2 0.005 0.1 0.002 0.929 0.923 0.27217 5 0.020 0.5 0.002 0.890 0.894 0.33518 6 0.005 0.5 0.002 0.909 0.907 0.30719 7 0.005 0.5 0.002 0.885 0.893 0.34320 6 0.005 1.0 0.002 0.905 0.893 0.31421 6 0.001 1.0 0.002 0.918 0.897 0.29322 2 0.010 0.1 0.020 0.920 0.898 0.28823 3 0.010 0.2 0.020 0.934 0.890 0.26324 2 0.005 0.1 0.020 0.917 0.897 0.29525 5 0.020 0.5 0.020 0.932 0.908 0.26726 6 0.005 0.5 0.020 0.936 0.910 0.25827 7 0.005 0.5 0.020 0.933 0.903 0.26628 6 0.005 1.0 0.020 0.919 0.880 0.29029 6 0.001 1.0 0.020 0.924 0.882 0.282

Table 2.4: Performance of qC for different parameter sets of the adaptationstage within the auditory processing model. The performance is given bythe correlation coefficient r, rank correlation coefficient rs, and standarddeviation SD for the ETSI test data base. The feedback loops were variedwith respect to the number n of loops, different loop time constants τ1 andτn, and different low-pass filter time constant τlp at the output. The looptime constants were distributed linearly between the first time constant τ1

and the last time constant τn, except for the original version and version “0”(denoted by *, see text).

Page 41: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.5 Discussion 33

performance is found in the versions with τlp = 0.02 s, which was the standardvalue from the original version.

For a fixed τlp = 0.02 s it is observed, that the correlation coefficient rdecreases for variations of the parameters n, τ1, and τn. However, the depen-dence of r and SD on the parameter set is relatively weak, i.e., the maximumof r (and the minimum of SD) is rather broad. Note, that also for n = 3 (inversion #23), n = 6 (#26) and n = 7 (#27) a very good performance of qCis observed. In the case of psychoacoustical masking experiments, the sameobservation was made, i.e., the exact choice of the parameter set of the feed-back loops does not have a very critical influence on the high performance ofthe model in threshold prediction (Verhey, 1998). This comparatively weakdependence of r on the exact parameters is a desirbale property of qC : If,for example, a distinct, sharp maximum of r would have been observed as afunction of the parameter set of the feedback loops, the general applicabil-ity of qC as a speech quality measure for various test data bases would bedoubtful.

Taken together, this investigation confirms, that the original version of theadaptation stage with n = 5, loop time constants τ1 = 5, τ2 = 50, τ3 = 129,τ4 = 253, and τ5 = 500 ms, and final low-pass time constant τlp = 20 msserves also as the optimal processing model for the present speech qualitymeasure qC . The value of τlp is more critical for the performance of qC thanthe time constants τi of the feedback loops. However, also a number of n = 3or n = 6 feedback loops can result in a high performance of qC .

2.5.5 Comparison of the “auditory processing model”with other algorithms

In the development of qC as an objective speech quality measure the speechsignals were transformed to their internal representation by the so-called “au-ditory processing model” (cf. section 2.3.3). In a set of earlier studies(Hansenand Kollmeier, 1996b), four alternative algorithms for this signal processingwere investigated in addition to the perception model. The original paper isreprinted in Appendix A

The resulting speech quality measures were compared with respect totheir performance in the speech quality prediction of the ETSI test database. The results of this studies showed that the perception model yieldsclearly the best performance compared to all other algorithms. Therefore, thefurther investigation of these alternative preprocessings was not continued.Also, additional test data bases were not available at that point of time.Nevertheless, the results confirmed the assumption that the adaptation stage

Page 42: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

34 Chapter 2: Objective speech quality measurement

with the feedback loops account for the most realistic “effective” descriptionof dynamic compression and temporal masking properties in the auditorysystem.

2.5.6 Comparison of qC and PSQM

In this section the speech quality prediction obtained with measure qC iscompared with the “perceptual speech quality measure” (PSQM) (Beerendsand Stemerdink, 1994b) in the version that was standardized by the ITU as“P.861”.

The general idea for the speech quality measurement is the same for qCand PSQM. Both methods transform the test sentence and the reference sen-tence to a signal representation by means of an auditory processing model.A distance measure is calculated from a comparison of the signal represen-tations. Some aspects of the PSQM were already mentioned in section 2.1.

The PSQM shows a high performance in speech quality prediction fordifferent speech quality test data bases. It was selected from a groupof several proposals as the candidate for an ITU recommendation on ob-jective speech quality measurement. As the consequence of the furtherevaluation of PSQM in Study Group 12 of the ITU-T body (ITU-T,1994), several simplifications of the signal preprocessing were introduced.The stages that accounted for spectral and temporal masking were dis-carded again. At the same time, the calculation of the final speech qual-ity measure was refined considerably (Beerends and Stemerdink, 1994a;Beerends, 1995). For example, a so-called “asymmetry effect” was accountedfor by this refined distance calculation algorithm. The result of the evalua-tion in Study Group 12 was the ITU-T recommendation “P.861” .Taken together, this measure exhibits a much simpler preprocessing and amore refined distance calculation than the measure qC presented before.

The P.861 speech quality measure was applied to the speech material ofthe same four test data bases used so far through out this study. The programcode that was available in this study iss limited to a fixed sampling rate of thespeech material of 8 kHz. Hence, the ETSI test data base was downsampledfrom 16 kHz to 8 kHz sampling frequency. This was done by employinga downsampling filter with a IRS filter characteristic as recommended forthis purpose by the ITU (ITU-T, 1996c). Since the ITU 8kbit test and theADPCM-cascading test material was originally sampled at 8 kHz, it was alsofiltered with the IRS filter, without downsampling.

The results of P.861 for the four speech quality test data bases are shownin Fig. 2.15–2.18. In each figure, the linear correlation coefficient r of thedata, the rank correlation coefficient rs, and the standard deviation SD are

Page 43: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.5 Discussion 35

given in the upper right corner.

For the ETSI test shown in Fig. 2.15 the speech quality prediction ob-tained from P.861 is not satisfactory. The correlations are r=-0.825, rs =-0.756. The standard deviation is SD=0.416. The reason for this relativelypoor performance is the low correlation in the subset of data belonging tothe codec-processed sentences. Within this subset, comparatively large devi-ations of P.861 are found for sentences with almost equal MOS. Additionally,at the high end of the subjective quality an unsatisfactory spread of P.861 be-tween the MNRU-processed and codec-processed sentences is observed. Forthis data base, qC performs clearly better (cf. Fig. 2.3) than P.861.

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.05 0.1 0.15 0.2

MO

S

objective speech quality P.861

rs: -0.756r: -0.825

SD: 0.416

111

11

1

222

2

22

333

33

3

444

44

4

5

555

55

66

66

6

6

mm

m

m

m

m

mm

Figure 2.15: ETSI test data base: Speech quality prediction with the ITU-recommended measure P.861. (compare with qC shown in Fig. 2.3).

For the ITU 8kbit test shown in Fig. 2.16 a high correlation (r =-0.917,rs ==-0.907) is observed. Especially the subgroup of codec-processed sen-tences shows a relatively small deviation from a straight line (SD =0.239MOS units). The group of MNRU-processed is found at the lower edge of thedata for the codec-processed sentences. The performance for the ITU 8kbittest is better than observed for qC , where a larger deviation occurred for thesentences at the low end of the subjective quality (cf. Fig. 2.4).

The P.861 results for the simulated-net ADPCM cascading test is shownin Fig. 2.17. A high performance (r =-0.926, rs =-0.941, and SD =0.349) isobserved. For sentences at the high end quality side a larger spread of P.861 isobserved, especially between the MNRU-processed and the codec-processed

Page 44: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

36 Chapter 2: Objective speech quality measurement

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

MO

S

objective speech quality P.861

rs: -0.907r: -0.917

SD: 0.3111a

1b

1c

1d

1e

1f

1g

1h1i

1j

2a2b2c

2d2e

2f2g

2h2i

2j3a3b

3c

3d3e

3f3g

3h

3i3j3k

3l3m

3n

3o3p

mm

m

m

m

mmm

Figure 2.16: ITU 8kbit test data base: Speech quality prediction with theITU-recommended measure P.861. (compare with qC shown in Fig. 2.4).

sentences. However, an excellent prediction is reached for the subgroup ofthe codec-processed sentences (r =-0.957, rs =-0.948, SD =0.238). This isnot found for qC with the same data base (cf. Fig. 2.5). In sum, qC andP.861 perform almost equally well for the complete set of sentences.

The results of P.861 with the real-net ADPCM cascading test are shownin Fig. 2.18. The performance is slightly worse than for the simulated-nettest material, an effect which was observed previously for qC , too. The per-formance of P.861 reaches r =-0.901, rs =-0.883, and SD =0.381, whichis clearly better than the performance of qC for this test data base. How-ever, as for the real-net test, a larger spread between MNRU-processed andcodec-processed sentences is observed at higher values of the MOS. The per-formance of P.861 for the subset of codec-processed sentences is relativelyhigh (r =-0.945, rs =-0.898, SD =0.251).

Taken together, the P.861 results presented in this section show that qCand P.861 perform about equally well on average for the four test data basesunder investigation. It should be noted, however, that the ADPCM cascadingtests are regarded to contain signal distortion that are easier to “catch” fromthe signal. For these data bases, qC shows a worse performance than P.861does. The ETSI test data base is generally regarded to contain sentences withsignal distortion that are harder to describe, because most codecs in this database operate at lower bit rates than in the other data bases. Consequently,these codecs produce “more nonlinear” signal distortion than the systems

Page 45: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.5 Discussion 37

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

MO

S

objective speech quality P.861

rs: -0.941r: -0.926

SD: 0.349

a

aa

bbb

cc

c

dd

d

ee

e

fffg

gg

h

hh

ii

i

j

jj

k

kk

m

m

m

m

m

mm

Figure 2.17: ADPCM cascading simulated-net connection test data base:Speech quality prediction with the ITU-recommended measure P.861. (com-pare with qC in Fig. 2.5).

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

MO

S

objective speech quality P.861

rs: -0.883r: -0.901

SD: 0.381

aaa

b

bb

c

ccdd

d

eee

f

f f

g g

gh

h

h

ii

i

jjj

k

kk

m

m

m

m

m

mm

Figure 2.18: ADPCM cascading real-net connection test data base: Speechquality prediction with the ITU-recommended measure P.861. (compare withqC in Fig. 2.6).

Page 46: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

38 Chapter 2: Objective speech quality measurement

in the other data bases. For the ETSI data base qC shows a clearly betterperformance than P.861 does.

2.6 Summary and Conclusion

In this chapter, a new objective speech quality measure was described. Themeasure was based on a psychoacoustically validated auditory processingmodel that was employed to transform the speech test signal and a corre-sponding reference signal to the so-called internal representation. The pro-cessing was previously employed to model detection thresholds in a varietyof simultaneous and non-simultaneous psychoacoustical masking experiments(Dau et al., 1996a; 1996b). These studies resulted in a set of optimal modelparameters which was used for the prediction of speech quality in this study.

The overall correlation coefficient calculated from the band-specificallyweighted internal representation of the two signals resulted in the objectivespeech quality measure qC . Another speech quality measure, qS, was ob-tained by calculating the mean squared difference instead of the correlationcoefficient, averaged over time and frequency.

Both measures were applied to four different speech quality test databases which contained low bit-rate coded speech material and the results ofits subjective speech quality evaluation in terms of Mean Opinion Scores.The performance of the objective speech quality measurement was expressedby the correlation and standard deviation observed between subjective andobjective data. Excellent performance of the new speech quality measureswas found in one test data base and a generally high performance was foundin the other three data bases. For all four data bases, the performance ofthe correlation based measure qC was found to be slightly higher comparedto the difference measure qS.

To improve the performance of the new speech quality measure a band-specific weighting had to be applied to the individual peripherally filteredchannels of the internal representation. The optimal weighting characteristicshowed constant weights for center frequencies below 1 kHz and increas-ing weights for increasing frequencies which strongly emphasized the highestchannels. A phenomenological explanation for this unusual characteristicwas given. Without the weighting, the objective quality of the group ofMNRU-processed sentences is generally overestimated. The performance ofthe measure is increased because weightings emphasizing the critical bandsat higher frequencies shift the objective measure of the MNRU-processed sen-tences towards lower quality. This effect results from the flat spectrum of themodulated noise distortion that is not present in codec-processed sentences

Page 47: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

2.6 Summary and Conclusion 39

in the high frequency region.It was shown, that the optimal parameter set resulting from psychoacous-

tical modeling yields the highest performance of the resulting speech qualitymeasure. The nonlinear adaptation stage within the preprocessing model, in-cluding the feedback loops, were identified as the salient feature of the modelto account for the realistic modeling of the dynamic compression propertiesas well as temporal adaptation and masking effects in the auditory system.However, the choice of the exact parameter values has a smaller influence onthe performance of the measure qC .

The comparison of the new speech quality measure with the PerceptualSpeech Quality Measure (PSQM) in the version standardized as P.861 by theITU showed, that both measures are comparable in their performance. P.861shows a more simple signal preprocessing algorithm but on the other handa more refined distance calculation algorithm compared to the new measurepresented here.

It was the purpose of this study to investigate, if the previously developedpreprocessing model could be applied to obtain perceptually adequate signalrepresentation for the objective measurement of speech quality. The resultsshowed, that it was indeed possible to measure speech quality objectivelywith a high performance.

In conclusion, the results of this study show that the “perception model”serves as a general model for auditory signal detection and speech perceptionwith a wide applicability. Effects at the threshold of detection can be modeledas well as phenomena of speech perception that are related to clearly audiblesupra-threshold signal properties.

Page 48: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Chapter 3

Perception of band-specificspeech quality distortions:detection and pairwisecomparison 1

Abstract

The relative importance of different critical bands for the perception of speechtransmission quality is investigated.Two algorithms are introduced that generate a band-specific modulated-noise dis-tortion in the speech signal. In a first experiment, detection thresholds were mea-sured as a function of the center frequency of the band used for generating thedistortion. In the second experiment, the pairwise speech quality preference wasassessed at 4 different center frequencies for 3 different levels of the modulationdepth. These modulation depths were selected relative to the respective detec-tion thresholds obtained from the first experiment. The subjective results werecompared with predictions obtained from the objective speech quality measure qCdescribed in the previous chapter.The detection thresholds were modeled by assuming a constant value of qC atthreshold. With a constant weighting of the filter channels of the internal repre-sentation, a constant value qC=0.9975 was found to predict subjects’ thresholdswith only small deviations. A weighting with increasing weights for increasingcenter frequencies of the internal representation yielded optimal performance ofqC for the speech quality prediction of various codec selection test data bases in

1A modified version of this chapter has been submitted for publication in ACUSTICAunited with acta acustica: Hansen and Kollmeier (1998) “Perception of band-specificspeech quality distortions: detection and pairwise comparison”

40

Page 49: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

41

previous studies. However, qC with this weighting can not account for the relativeimportance of individual bands for the perception of speech quality found in thisstudy. Pairwise speech quality preference ratings were modeled by the difference∆qC of the measure qC with constant spectral weighting, but not for a spectralweighting increasing with frequency. The accuracy of the prediction was found tobe larger for the broadband than for the narrowband distortion algorithm.

In conclusion, a constant spectral weighting appears appropriate for detectionand quality comparison tasks, whereas the absolute assessment of speech qualitydistortions appears to put more emphasis on high-frequency bands.

Page 50: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

42 Chapter 3: Perception of speech quality distortions

3.1 Introduction

Several objective speech transmission quality measures have been describedin the literature (Mermelstein, 1979; Wang et al., 1991; Herre et al., 1992;Baillard et al., 1992; Beerends and Stemerdink, 1994b; Hansen and Kollmeier,1997b), (see chapter 2 of this thesis). They usually analyze the transmittedspeech signal in different bands with respect to the nonlinear distortionsintroduced by the transmission channel (e.g., involving low bit-rate speechcoding). In order to derive a prediction of the perceived overall speech trans-mission quality, the contributions from these different frequency bands haveto be weighted and summed up. This study attempts to derive appropri-ate frequency weighting functions by considering the frequency-dependentperception of speech quality distortions in humans.

In the experiments described in chapter 2, several frequency-dependentweightings for the different peripherally filtered and nonlinearly adapted sub-bands of the speech signal were tested. The optimal choice (with respectto optimizing the performance of the objective speech quality measure qC)proved to be a characteristic with increasing weights for increasing centerfrequency above 1 kHz and constant weights below 1 kHz. The performancecriterion was a high correlation coefficient and rank correlation coefficientas well as a small standard deviation between the subjective MOS speechquality data and the objective measure qC . However, this choice lacks apsychophysical verification of the “optimal” frequency weighting for speechquality assessment.

The existence of a non-uniform weighting for individual auditory bandsis a concept well-known from classical speech intelligibility measures suchas the articulation index (AI) (Kryter, 1962), the speech transmission index(STI) (Steeneken and Houtgast, 1980; Houtgast and Steeneken, 1985) andrelated measures like the speech intelligibility index (SII, ANSI, S 3.79), andthe modulation-based speech intelligibility index (MTI, Koch (1992)). Thesemeasures usually incorporate a band-weighting for the 7 octave bands withcenter frequencies from 125 Hz to 8 kHz. The general shape of the weightingsemployed puts an emphasis in the middle frequencies and employs lowerweights at the low and high frequency end. In contrast, the optimal weightsfound in chapter 2 increase for frequencies above 1 kHz and reaches themaximum for the highest filter channel at 3.8 kHz (cf. the weighting labeled“low1k-inv” in chapter 2, section 2.5.1). In the following, this increasingweighting characteristic is denoted by the symbol Winc.

In this context it is interesting to test the assumption that the perceptionof speech quality involves a certain weighting of the peripheral filter outputsthat constitute the “internal representation” of a speech signal. In addition,

Page 51: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.2 Experimental setup 43

the perceptual relevance of an increasing weighting characteristic with fre-quency should be evaluated. In this chapter, experiments are described thatwere aimed at measuring the importance of individual frequency bands forspeech quality perception. To do so, speech signal distortions were gener-ated that were band-specific in some sense, and that were comparable to thedistortions of the low-bit rate coded signals contained in the test data basesof the previous chapter. The rationale behind this was to investigate theinfluence of the spectral region of a signal distortion on the detection and onthe subjective quality assessment of the speech stimuli.

Detection thresholds of various band-limited modulated noise distortionswere measured in a 2 AFC experiment. In a further experiment, the subjec-tive quality of supra-threshold distortions was assessed. The experimentaldata were compared with objective model predictions, obtained with themeasure qC for different spectral weightings.

3.2 Experimental setup

3.2.1 Speech stimulus material and test signal genera-tion

The speech material was selected from the reference material of the Ger-man language ETSI Halfrate Selection Test. These sentences are part of thecorpus “Satze fur Sprachgutemessungen mit deutscher Sprache” (Sotscheck,1984), provided by the Deutsche Telekom AG. The material was sampledat 16 kHz and subsequently telephone-band-pass filtered by the “modifiedIntermediate Reference System” (IRS)-filter (ITU-T, 1996c). This filter isespecially designed and standardized to have an average spectral telephonetransmission characteristic. All sentences were calibrated to an equal ActiveSpeech Level (ASL) (ITU-T, 1996c) of -30 dBov.

In order to cover a maximum range of assessment variability across speechstimuli with a minimum set of test sentences, a pre-experiment was per-formed: From the reference material, five sentences of each of the two male(m1,m2) and two female speakers (f1,f2) were initially selected. The broad-band detection thresholds were obtained with 3 subjects for the 20 sentencespresented in an interleaved mode (Hansen and Kollmeier, 1997a). On thebasis of the most dissimilar thresholds across these sentences, one sentenceof speaker m1 and one sentence of speaker m2 were chosen as reference inputstimuli for the following experiments.

All test stimuli were digitally generated at a sampling frequency of 16 kHz.Signal generation and presentation were controlled with a SGI workstation

Page 52: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

44 Chapter 3: Perception of speech quality distortions

using the signal processing software SI developed at the Drittes Physikalis-ches Institut in Gottingen. The 16 bit digitized signals were D/A convertedusing the built-in board and amplified with an external computer-controlledamplifier. The speech stimuli were presented to the subjects via SennheiserHDA 200 headphones in a sound-insulated chamber. The listening level wasadjusted to an active speech level of 75 dB SPL.

A controlled speech quality degradation was generated by means of aModulated Noise Reference Unit (MNRU) (CCITT, 1989). This device iscommonly used as a reference system for speech quality assessment (cf. chap-ter 2, sections 2.2). The amount of distortion introduced by this MNRU iscontrolled by a single parameter. The percept associated with the distor-tion is a good approximation to that of typical low-bit rate speech codingdevices. The principle of the MNRU is shown in Fig. 3.1. The MNRU mod-

Attenuator/amplifier

Attenuatoror amplifier − Q dB

Filter100−3400 Hz

Input Output

0−20 kHz noise source

Figure 3.1: Block diagram of the Modulated Noise Reference Unit (MNRU),(CCITT, 1989). The speech input signal is modulated by a wide-band whitenoise. According to the standardization, the modulation depth m and theMNRU-parameter Q are related by Q = −20 log10m.

ulates the input speech signal s(t) according to x(t) = s(t) · (1 + m · n(t))where n(t) is a white noise with unity variance and m is the modulationdepth. For the narrowband (telephone-band) MNRU, the modulated signalx(t) is subsequently band-pass filtered to telephone-bandwidth from 100-3400 Hz. In the ITU standard (CCITT, 1989), the parameter Q controllingthe MNRU is defined as the negative of the modulation depth expressed in

Page 53: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.2 Experimental setup 45

dB, Q = −20 log10 m. This leads to values of Q with opposite sign comparedto standard modulation detection experiments where 20 logm < 0.

3.2.2 Generation of band-specific distortions

Narrowband distortion

For the generation of a narrowband distortion the following two steps areperformed:

1. The input speech signal is modulated by the MNRU, with a modulationdepth of Q dB, and subsequently band-pass filtered at a given centerfrequency fc between 0.9 · fc and 1.1 · fc, i.e., approximately within acritical bandwidth.

2. The input signal is notch filtered, with corresponding center frequencyand bandwidth of the notch as in 1., and added to the band-pass filteredsignal (see Fig. 3.2).

input speechsignal

band−pass filter

notch filter

MNRU Q dB

output

Figure 3.2: Schematic signal generation for narrowband distortion.

The distortion is thus spectrally limited to only one critical band (“nar-rowband distortion”). It carries the information of the full broadband speechsignal. For higher modulation depths this narrowband distortion subjectivelyexhibits a tonal percept with a pitch corresponding to the center frequencyfc. However, this type of signal distortion perceptually differs from those oflow-bit rate speech coders, since it sounds similar to an additive narrowbandnoise signal centered at fc. Pre-investigations showed that it is hardly possi-ble to generate a parameterizable distortion that subjectively sounds similarto that of typical low-bit rate speech coders and that is band-limited to apredefined region at the same time.

Page 54: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

46 Chapter 3: Perception of speech quality distortions

Broadband distortion

For the generation of a broadband distortion the following two steps areperformed:

1. The input speech signal is band-pass filtered at a given center frequencyfc between 0.9·fc and 1.1·fc, i.e., approximately within a critical band-width, and subsequently modulated by the MNRU, with a modulationdepth of Q dB.

2. The input signal is notch filtered, with corresponding center frequencyand bandwidth as in 1., and added to the modulated signal (Fig. 3.3).

input speechsignal

notch filter

band−pass filter

MNRU Q dB

output

Figure 3.3: Schematic signal generation for broadband distortion.

The distortion is thus broadband, carrying information from only onecritical band of the speech signal. This kind of signal distortion does notproduce a tonal percept but sounds very similar to the standard MNRU.The difference to the standard MNRU is that the audibility of modulatednoise depends on the presence of speech energy in the critical band at fc andits level relative to the energies in the adjacent bands. In this respect, thedistortion can be viewed as band-specific.

3.2.3 Detection of band-specific distortions

In two sub-experiments for the two different test signal generation schemesdescribed in subsections 3.2.2 and 3.2.2, the modulation depthm at thresholdwas measured as a function of the center frequency fc.

An adaptive 2-AFC 1-up 2-down paradigm was used to measure the 70.7%point o the psychometric function for correct responses (Levitt, 1971). Ineach trial, one randomly chosen interval contained the reference signal and

Page 55: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.2 Experimental setup 47

the other interval the distorted test signal. The two intervals were separatedby a silent interval of 500 ms. The subjects’ task was to specify the intervalcontaining the test signal. At the beginning of each run, the modulationdepth was set to Q=0 dB. The step size was 4 dB at the beginning and washalved after every reversal until the minimum of 1 dB was reached. With thisstep size, four reversals were obtained and their median value was computedas a threshold estimate. The subjects received visual feedback after eachtrial.

The threshold was measured individually using each of the two referencesentences (m1 and m2) at the frequencies fc=300, 500, 700, 1000, 1500,2000, 2500, 3000, 3500 Hz. In one run, one reference sentence (m1 or m2)was measured at a time.

3.2.4 Assessment of band specific distortions by pairedcomparison

The purpose of this experiment was to identify the influence of level and fre-quency of supra-threshold band specific modulated noise distortions. In twosub-experiments for the narrowband and the broadband modulation schemes(subsections 3.2.2 and 3.2.2, respectively), a speech quality assessment wasperformed by means of a paired comparison method.

The results from the previous threshold experiments (subsection 3.2.3)were used to determine the mean value of the modulation depth parameterat threshold mthres(fc) at the frequencies fc=500, 1500, 2500, and 3500 Hz.This was done separately using the two reference sentences (m1 and m2).

For each of the four center frequencies, three test signals were generatedfrom the reference sentence by employing the band-specific noise modulationat the supra-threshold levels Q = −(mthres(fc) + 5), −(mthres(fc) + 10), and−(mthres(fc)+15) dB. In this way, 12 test sentences containing audible suprathreshold signal distortions were generated from each of the two referencesentences.

A complete paired comparison method was applied to assess the sub-jective speech quality preference of the 12 stimuli. During one run of theexperiment, all (12

2) = 66 possible pairs of test stimuli were presented to the

subjects in random order. In each trial, the test stimuli were separated bya silent interval of 300 ms. The subject’s task was divided into two steps:First, the subject had to specify the interval which was preferred with re-spect to the (overall) speech quality. Second, the subjects had to rate the“confidence” of his/her preference on a numerical integer scale from 0 (indi-cating undecided, i.e., stimuli were perceived as being almost equal in speech

Page 56: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

48 Chapter 3: Perception of speech quality distortions

quality) to 3 (indicating a clear decision for the distinct preference for one ofthe stimuli). The interpretation of the “strength” of the numbers 1,2, and 3was left open to the subjects.

For each of the two signal generation schemes, each subject participatedin one run for each of the two reference sentences.

3.2.5 Subjects

All subjects participated voluntarily in the experiments. They had some priorexperience with psychoacoustical experiments. They had clinically normalhearing as defined by a hearing loss < 15 dB at all audiological standardfrequencies and no history of hearing problems.

Five male subjects aged between 25 and 34 years participated in thedetection experiments. All five subjects participated in the narrowband sub-experiment, and three of the subjects participated in the broadband sub-experiment.

Two female and nine male subjects aged between 24 and 34 years par-ticipated in the quality assessment experiment by pairwise comparison. Alleleven subjects participated in the narrowband sub-experiment and ten par-ticipated in the broadband sub-experiment.

3.3 Results

3.3.1 Detection thresholds

The results of the detection threshold experiments are displayed in Fig. 3.4and 3.6 for the narrowband and the broadband experiments. Median andinterquartile values of the modulation depth at threshold are plotted indi-vidually for each subject.

Narrowband distortion

For the narrowband distortion scheme, a maximum threshold at 500 Hz isobserved for both sentences (cf. Fig. 3.4). For sentence m1 (upper panel),three subjects exhibit significantly higher thresholds at 3500 Hz than theother two subjects. For sentence m2 (lower panel) one subjects shows anelevated threshold at 3500 Hz compared to the other three subjects. With theexception of these large inter-individual variability at the highest frequencyconsidered, the threshold data show a common characteristic across subjectsfor each of the sentences. For comparison, the average power spectra ofthe two reference sentences used in the experiments are plotted in Fig. 3.5.

Page 57: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.3 Results 49

-30

-25

-20

-15

-10

-5

0

0 500 1000 1500 2000 2500 3000 3500

-Qth

res

= 2

0 lo

g m

thre

s [d

B]

center frequency [Hz]

narrow band distortion, sentence 1

MHTDJAJVRB

-30

-25

-20

-15

-10

-5

0

0 500 1000 1500 2000 2500 3000 3500

-Qth

res

= 2

0 lo

g m

thre

s [d

B]

center frequency [Hz]

narrow band distortion, sentence 2

MHTDJAJVRB

Figure 3.4: Detection thresholds for the narrowband band-specific distortion.The median and interquartile values of the modulation depth at thresholdsare displayed for five subjects. Upper panel: reference sentence 1, lowerpanel: reference sentence 2.

Page 58: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

50 Chapter 3: Perception of speech quality distortions

Obviously, the threshold data for both sentences in Fig. 3.4 correspond wellwith the respective average spectra in Fig. 3.5 for frequencies below 3 kHz.

-30

-20

-10

0

10

20

30

40

0 500 1000 1500 2000 2500 3000 3500 4000

rel.

leve

l [dB

]

frequency [Hz]

spectral power density, sentence 1

-30

-20

-10

0

10

20

30

40

0 500 1000 1500 2000 2500 3000 3500 4000

rel.

leve

l [dB

]frequency [Hz]

spectral power density, sentence 2

Figure 3.5: Average spectral energy density of telephone band-pass filteredspeech, calculated for the two reference sentences used in the experiments.Left panel: sentence m1, right panel: sentence m2.

As noted above (see 3.2.2), the narrowband noise distortion is quite sim-ilar to an additive noise centered at the respective frequency fc. Hence, theobserved data can be interpreted in terms of typical masking experiments:While the speech signal takes the role of the masker, the modulated band-pass filtered noise is the signal to be detected. Thus, the spectral shapeof the masking speech sample determines the frequency-dependent maskingpattern of the narrowband test signal.

Broadband distortion

For the broadband distortion scheme, the threshold data show a differentcharacteristic for the two reference sentences (cf. Fig. 3.6). For sentencem1 (upper panel), a very low threshold is observed at fc =500 Hz. Thethreshold increases up to fc =2 kHz, drops at 2.5 kHz and is close to 0 dB atfc =3.5 kHz. For sentence m2 (lower panel), the threshold at fc =500 Hz isnot as low as for sentence m1. The thresholds show a slight drop at 1.5 kHz.They are maximal at fc =2.5 kHz and drop considerably for the two highesttest frequencies. In fact, the data shown here appear to partly “mirror” thedata of the narrowband experiment (cf. Fig. 3.4) and the average speechspectra (cf. Fig. 3.5). One explanation is, that the broadband distortioncontrolled by a spectral region with high energy is detectable at any frequencywith little spectral energy of the speech signal. Therefore, spectral peaks inthe speech waveform will produce minima in the observed threshold dataand vice versa. It is noteworthy, that the distortion may be detectable at

Page 59: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.3 Results 51

-25

-20

-15

-10

-5

0

0 500 1000 1500 2000 2500 3000 3500

-Qth

res

= 2

0 lo

g m

thre

s [d

B]

center frequency [Hz]

broad band distortion, sentence 1

MHRKVK

-25

-20

-15

-10

-5

0

0 500 1000 1500 2000 2500 3000 3500

-Qth

res

= 2

0 lo

g m

thre

s [d

B]

center frequency [Hz]

broad band distortion, sentence 2

MHRKVK

Figure 3.6: Detection thresholds for the broadband band-specific distortion.The median and interquartile values of the modulation depth at thresholdare displayed for three subjects. Upper panel: sentence m1, lower panel:sentence m2.

Page 60: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

52 Chapter 3: Perception of speech quality distortions

different temporal positions and with different duration for each test centerfrequency and for each individual reference sentence. The results will beexplained further using a modeling approach described in section 3.4.1.

3.3.2 Speech quality assessed by paired comparison

For the description of the preference data, the following notation is intro-duced:

Let Ri,j denote the preference rating defined as

Ri,j =

{1, sentence i is preferred over j−1, sentence j is preferred over i

for i 6= j. (3.1)

Let Ci,j denote the confidence rating given for the sentence pair (i, j).Since either the pair (i, j) or the pair (j, i) was presented to the subjects, itis assumed that

Ci,j = Cj,i , Ri,j = −Rj,i (3.2)

Using (3.2), the preference data of a subject can be represented in a 12x12triangular Matrix Pi,j,

Pi,j =

{Ri,j · Ci,j , i < j

0 , i ≥ j(3.3)

The matrix P can then be transformed into a speech quality rank orderfor the respective 12 test signals:

ranki =∑

i<j,Ci,j 6=0

Ri,j +∑

i<j,Ci,j=0

1

2, i, j = 1 . . . 12. (3.4)

Note, that in this definition of the rank score, the confidence rating isonly considered when Ci,j = 0. In this case an equal weight of 1

2is assigned

to the rank score of either sentence.This definition leads to a normalized rank score with

12∑i=1

ranki =(

12

2

)= 66, (3.5)

i.e., the sum of all rank scores is equal to the number of preference ratingsassessed by the subject in one run.

For an overview of the subjective speech quality results, the rank scoresfor each set of the 12 test signals are shown in Fig. 3.7 for the narrowbandand in Fig. 3.8 for the broadband distortion scheme.

Page 61: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.3 Results 53

With one exception (i.e., rank order for sentences 2 for broadband dis-tortion at fc = 3500 Hz) the results are very consistent: The condition withthe respective least distortion is assigned the highest rank score and viceversa. In addition, there appears no systematic variation of the preferencerank data across frequency. Note, that the modulation depth parameters Qwere selected to be equally spaced above the respective frequency-dependentthreshold.

0

2

4

6

8

10

cf=500 Hz cf=1500 Hz cf=2500 Hz cf=3500 Hz

rank

sco

re

sentence m1

5 5 5 510 10 10 1015 15 15 15

0

2

4

6

8

10

cf=500 Hz cf=1500 Hz cf=2500 Hz cf=3500 Hz

sentence m2

5 5 5 510 10 10 1015 15 15 15

Figure 3.7: Rank scores of the 12 test sentences for narrowband modulatednoise distortion. For each center frequency, the rank score for the sentenceswith relative increase of Q +5 dB (left column), +10 dB (middle), and+15 dB (right column) are displayed. Left panel: sentence m1, right panel:sentence m2.

0

2

4

6

8

10

cf=500 Hz cf=1500 Hz cf=2500 Hz cf=3500 Hz

rank

sco

re

sentence m1

5 5 5 510 10 10 1015 15 15 15

0

2

4

6

8

10

cf=500 Hz cf=1500 Hz cf=2500 Hz cf=3500 Hz

sentence m2

5 5 5 510 10 10 1015 15 15 15

Figure 3.8: Rank scores of the 12 test sentences for broadband modulatednoise distortion, displayed in the same way as in Fig. 3.7. Left panel: sentencem1, right panel: sentence m2.

Page 62: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

54 Chapter 3: Perception of speech quality distortions

3.4 Modeling predictions

In this section, model predictions on the basis of the objective speech qualitymeasure qC developed in chapter 2 of this thesis are compared with theexperimental data.

To quantitatively predict psychoacoustical detection and masking exper-iments, Dau et al. employed a likelihood ratio by calculating the cross cor-relation between an expected (supra threshold) signal (“template”) and thereceived signal at the level of the “internal representation” of the signals(Dau, 1996; Dau et al., 1996a). If the likelihood ratio l exceeds a certaincriterion c, the received signal is decided to contain the expected test signalin addition to the background masker signal.

On the basis of this decision rule, Dau’s model accounts for a humanobservers’ performance by running through the same adaptive 3 AFC (or2 AFC) 1-up 2-down adaptive procedure to converge to the test parameterat threshold.

3.4.1 Modeling distortion detection thresholds

To understand the behaviour of qC as a function of fc and Q, a slightlydifferent approach was introduced compared to Dau et al. (1996b). For eachof the two signal distortion algorithms (narrow/broad) and the two referencesentences, the measure qC was computed at values of the test parameter Qspaced equidistantly with intervals of 2 dB within the range from Q=0 dBto 28 dB.

Two different weighting characteristics used for the calculation of qC (cf.chapter 2, section 2.3.4 of this thesis) were tested : a constant weighting, anda weighting with increasing weights for frequencies f > 1 kHz. The latterresulted in the optimal prediction of the absolute speech quality for the testdata bases investigated in chapter 2. In the following, these two weightingsare denoted by Wconst and Winc, respectively. The result of this computationof qC as a function of Q and fc is illustrated in Fig. 3.9. Additionally, iso-qC-contours were interpolated from the data. They were plotted both into theqC data and also projected onto the horizontal Q-fc plane in the same figure.

The best prediction of the subjects’ threshold data plotted in Fig. 3.4 andFig. 3.6 is obtained by computing the iso-qC-curves for a value of qC ≈ 0.9975,using the constant weighting Wconst.

Therefore, the threshold data and the interpolated functionsQ(fc)|qC=0.9975 and Q(fc)|qC=0.997 were plotted into one graph. In Fig. 3.10and Fig. 3.11 the results are shown for the narrowband and the broadbanddetection experiments, respectively.

Page 63: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.4 Modeling predictions 55

narrowband detection exp., sentence 1, const. weighting

qC 0.999 0.998 0.997 0.996 0.995

05

1015

2025 0 0.5 1 1.5 2 2.5 3 3.5

0.98

0.985

0.99

0.995

1

mod. depth [dB] center freq. [kHz]

qC

Figure 3.9: Objective speech quality measure qC (with the constant weightingWconst) calculated for all center frequencies fc and all modulation depths Q ona regular grid for the narrowband distortion for sentence m1. Additionally,the iso-qC contour levels are projected to the frequency-Q plane.

A good prediction of the experimental thresholds by the functionQ(fc)|qC=0.9975 is observed for all four combinations of the narrow and broad-band distortion with the two reference sentences. The condition that, for var-ious signal detection tasks, qC is a constant value at the detection thresholdis in line with Dau et al.’s approach to model threshold data (see AppendixB).

In Fig. 3.12 and 3.13 the functions Q(fc)|qC=const. with the weighting Winc

are shown for the levels qC = 0.99, 0.98, . . . , 0.95. It is obvious from thegraphs, that theWinc weighting is very inappropriate for modeling the thresh-old data. Note, that for the same test signals as before, the range coveredby qC is stretched out by a factor of ≈ 10 compared to Fig. 3.10 and 3.11 forconstant weighting.

It can be concluded, that the standard preprocessing model with constantspectral weighting across filter channels provides a good fit to the detectionthreshold data obtained in this study while the weighting Winc fails to doso. Note, that a constant weighting was also employed by Dau et al. in theirapproach to model threshold in a variety of different detection tasks.

The increasing weighting Winc, on the other hand, was successfully intro-duced in order to obtain a better prediction of subjective speech transmissionquality data (Hansen and Kollmeier (1997b), see chapter 2).

Page 64: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

56 Chapter 3: Perception of speech quality distortions

-25

-20

-15

-10

-5

0

0 500 1000 1500 2000 2500 3000 3500

mod

ulat

ion

dept

h [d

B]

center freqency [Hz]

narrow band detection exp., sentence 1

qC=0.997 qC=0.9975

exp. threshold

-25

-20

-15

-10

-5

0

0 500 1000 1500 2000 2500 3000 3500

mod

ulat

ion

dept

h [d

B]

center freqency [Hz]

narrow band detection exp., sentence 2

qC=0.997 qC=0.9975

exp. threshold

Figure 3.10: Threshold data and interpolated iso-qC-contours Q(fc)|qC=0.997

and Q(fc)|qC=0.9975 for the narrowband detection experiment. Left panel:sentence m1, right panel: sentence m2.

-25

-20

-15

-10

-5

0

0 500 1000 1500 2000 2500 3000 3500

mod

ulat

ion

dept

h [d

B]

center freqency [Hz]

broad band detection exp., sentence 1

qC=0.997 qC=0.9975

exp. threshold-25

-20

-15

-10

-5

0

0 500 1000 1500 2000 2500 3000 3500

mod

ulat

ion

dept

h [d

B]

center freqency [Hz]

broad band detection exp., sentence 2

qC=0.997 qC=0.9975

exp. threshold

Figure 3.11: Threshold data and interpolated iso-qC-contours Q(fc)|qC=0.997

and Q(fc)|qC=0.9975 for the broadband detection experiment. Left panel: sen-tence m1, right panel: sentence m2.

Page 65: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.4 Modeling predictions 57

The use of different weighting functions for different goals in modelinghuman perception will be discussed below (see 3.5).

narrow band distortion, sentence 1, incr. weighting

0

-5

10

-15

-20

-250 0.5 1 1.5 2 2.5 3 3.5

mod

ulat

ion

dept

h [d

B]

center freq. [kHz]

narrow band distortion, sentence 2, incr. weighting

0

-5

10

-15

-20

-250 0.5 1 1.5 2 2.5 3 3.5

mod

ulat

ion

dept

h [d

B]

center freq. [kHz]

Figure 3.12: Iso-qC functions Q(fc)|qC=const. projected to the Q-fc plane forlevels of qC = 0.99 (lowest curve), 0.98, 0.97, 0.96, and 0.95 (up-most curve)for the narrowband distortion. Left panel: sentence m1, right panel: sentencem2.

broad band distortion, sentence 1, incr. weighting

0

-5

10

-15

-20

-250 0.5 1 1.5 2 2.5 3 3.5

mod

ulat

ion

dept

h [d

B]

center freq. [kHz]

broad band distortion, sentence 2, incr. weighting

0

-5

10

-15

-20

-250 0.5 1 1.5 2 2.5 3 3.5

mod

ulat

ion

dept

h [d

B]

center freq. [kHz]

Figure 3.13: Iso-qC functions Q(fc)|qC=const. projected to the Q-fc plane forlevels of qC = 0.99, 0.98 (curves drawn for fc > 2.5 kHz), and 0.97, 0.96,and 0.95 (drawn for fc < 2.5 kHz) for the broadband distortion. Left panel:sentence m1, right panel: sentence m2.

3.4.2 Modeling speech quality by pairwise comparison

Similarly to the modeling of thresholds in the previous section, for each ofthe (fixed) test signals employed in the subjective experiments, the objectivespeech quality measure qC was calculated, once with constant weighting, and

Page 66: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

58 Chapter 3: Perception of speech quality distortions

once with the weighting Winc. For each pair of sentences, the correspondingdifference ∆qC was computed in order to compare the subjective preferencedata described in section 3.3.2 with model predictions.

The pairwise preference ratings averaged across subjects were plottedagainst the corresponding ∆qC for each pair of test sentences presented to thesubjects. The data for ∆qC with constant weighting are plotted in Fig.3.14for the narrowband and in Fig. 3.15 for the broadband preference experiment.

The comparison of the data in Fig. 3.14 and 3.15 shows, that the inter-individual standard deviations for each preference rating are generally largerfor the narrowband distortion than for the broadband distortion experiment.This corresponds to the reports of some subjects that the narrowband dis-tortion was more difficult to assess. For the narrowband experiment, a rea-sonable correlation is found for the preference data of sentence m1 (upperpanel). For sentence m2 (lower panel), a smaller correlation is observed andlarger deviations of the subjective preferences occur for sentence pairs withalmost equal distance ∆qC . For this condition (sentence m2 and narrow-band distortion), subjects reported about difficulties with the assessmenttask most often. The separate group of data with qC > 0.009 in the rightside of the lower panel of Fig. 3.14 belongs to conditions where fc = 3.5 kHzand Q = Qthres + 15 dB in one sentence of a test pair, i.e., the highest mod-ulation depth at the highest test frequency. Together with a local maximumof the spectral energy of sentence m2 around 3.5 kHz (cf. Fig. 3.5), thisleads to a partial spread of noise energy into adjacent bands of the internalrepresentation which results in a reduction of qC , i.e., an increase in ∆qC rel-ative to another test sentence. This observation of a separate group of datacan be explained as an “edge-effect”: The subjective quality data are limitedby a lower (and an upper) bound while the objective measure qC does nothave the same limit. Hence, for two test sentences that are both subjectivelyassessed at the lower bound of the quality rating scale qC may still take onconsiderably different (small) values.

In the broadband condition, a higher correlation between subjective pref-erences and ∆qC is observed, except for a separate group of data for sentencem1 where ∆qC < −0.01 (left side of the upper panel of Fig. 3.15). A furtherinspection shows that all of these cases belong to sentence pairs with the con-ditions fc = 500 Hz, Q = Qthres(fC)+15 dB in one sentence and fc 6= 500 Hzin the other sentence of a pair. This observation can again be explained as an“edge-effect” as above: Sentence m1 has a maximum of the spectral energydensity around 500 Hz (approximately 20 dB above the rest of the speechspectrum, cf. Fig. 3.5), which is much more pronounced in sentence m1 thanin sentence m2. With the broadband distortion algorithm this leads to a

Page 67: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.4 Modeling predictions 59

-4

-3

-2

-1

0

1

2

3

4

-0.015 -0.01 -0.005 0 0.005 0.01 0.015

subj

ectiv

e pr

efer

ence

delta qC, constant weighting

narrowband experiment, sentence m1

rs: 0.741r: 0.708

SD: 0.955

-4

-3

-2

-1

0

1

2

3

4

-0.015 -0.01 -0.005 0 0.005 0.01 0.015

subj

ectiv

e pr

efer

ence

delta qC, constant weighting

narrowband experiment, sentence m2

rs: 0.586r: 0.570

SD: 0.907

Figure 3.14: Subjective quality preference rating for all 66 sentence pairsplotted against the corresponding difference ∆qC for the same pair of sen-tences for the narrowband experiment. Upper panel: sentence m1, lowerpanel: sentence m2. The correlation coefficient (r), Spearmans’ rank corre-lation coefficient (rs), and the standard deviation (SD) of the data are givenin the bottom of each panel.

Page 68: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

60 Chapter 3: Perception of speech quality distortions

-4

-3

-2

-1

0

1

2

3

4

-0.03 -0.02 -0.01 0 0.01 0.02 0.03

subj

ectiv

e pr

efer

ence

delta qC, constant weighting

broadband experiment, sentence m1

rs: 0.683r: 0.628

SD: 1.270

-4

-3

-2

-1

0

1

2

3

4

-0.03 -0.02 -0.01 0 0.01 0.02 0.03

subj

ectiv

e pr

efer

ence

delta qC, constant weighting

broadband experiment, sentence m2

rs: 0.923r: 0.901

SD: 0.775

Figure 3.15: Quality preference rating for all 66 sentence pairs plotted againstthe corresponding difference ∆qC for the same pair of sentences for the broad-band experiment. Upper panel: sentence m1, lower panel: sentence m2.

Page 69: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.4 Modeling predictions 61

correspondingly strong difference in the internal representation compared toall other test sentences with frequencies fc 6= 500 Hz. In these cases, ∆qC issubjected to a constant shift due to the relatively small value of qC for thesentence with fc = 500 Hz, Q = Qthres(fC) + 15 dB. However, within thisseparate group of sentences, again a high correlation between ∆qC and thesubjective preference is observed. For sentence m2 (lower panel of Fig. 3.15),which exhibits a much flatter spectral energy distribution, a remarkably highcorrelation is found between ∆qC and the subjective preference data.

The modeling results confirm that ∆qC with a constant weighting is anappropriate measure for the quality preference ratings of two test sentences.

The averaged subjective preference data were also compared with thecorresponding ∆qC with the weighting Winc, which pronounces especially thehighest filter channels of the signal’s internal representation. These data arenot plotted here. The entire set of data for ∆qC versus subjective preferencedo not correlate well with each other, i.e., the rank correlation coefficient rsis at unsatisfactory low values (|rs| < 0.01) for the narrowband distortionexperiment and at rs < 0.56 for the broadband distortion experiment.

It is observed that the subjective preference versus ∆qC data form clearlyseparated groups for the narrowband distortion experiment. This can be ex-plained by the influence of Winc: The highest weight at the highest frequencyf =3.8 kHz is 10 times larger than the weight at f =1 kHz. Within qC , thisweighting strongly pronounces the differences in the highest bands of twointernal representations. It leads to a relatively low prediction of the speechquality by qC for a stimulus that exhibits large distortions in this spectralregion. The test sentences of the narrowband experiment with fc =3.5 kHzrepresent such stimuli. Depending on the relative level of Q of these sen-tences, qC , and thus also ∆qC will produce values in separated intervals, asobserved. Also for the broadband experiment distinct groups of data areobserved. The effect is, however, not as strong as in the narrowband exper-iment. This is clear, because the modulated-noise distortion is spread overthe full bandwidth of the speech signal. Hence, the spectral weighting has areduced effect compared to the narrowband experiment where the distortionis present only in one critical band.

Taken together, the prediction of the subjective preference by ∆qC is un-satisfactory when the weighting Winc is applied to the internal representation.

Additionally to the prediction of the pairwise preference rating by ∆qCfor the corresponding sentences, also the quality rank order was predicted.To do so, the rank score computed from the subjective results as describedin section 3.3.2 was compared with the measure qC with constant weighting

Page 70: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

62 Chapter 3: Perception of speech quality distortions

calculated for the respective test sentence. The result is shown in Fig. 3.16for the narrowband and in Fig. 3.17 for the broadband experiment. At eachdata point plotted in these figures, the center frequency fc and the relativemodulation depth above threshold are indicated. The prediction of the rankby qC with constant weighting shows a corresponding degree of accuracy asobserved above in the prediction of the pairwise preference by ∆qC .

A reasonable prediction of the subjective rank scores by qC can not beobtained when employing the weighting Winc within the calculation of qC .In this case, the sentences with the condition fc=3.5 kHz can clearly beidentified as outliers within the data set.

This result indicates again, that the high weights ofWinc are inappropriatefor the prediction of the preference or rank score data obtained in the exper-iments. Employing Winc within the measure qC results in over-pronounceddifferences in the internal representations of the test sentences and thus inan underestimated speech quality.

3.5 Discussion

The subjective preference data in the narrowband distortion experimentshowed a comparatively large standard deviation across subjects for mostof the assessed sentence pairs, especially for sentence m2 (cf. section 3.4.2).The reason for this effect may be that narrowband distortion stimuli withtwo different test frequencies fc exhibit a different tonal pitch percept andare therefore more difficult to compare than in the broadband experiment.A similar effect is known, e.g., in loudness assessment experiments, when thetest tone and the reference tone have a large spectral distance from each other(Gabriel, 1996). In this case subjects also reported difficulties in their taskand show larger inter-individual standard deviations. The larger deviationsobserved in speech quality preference data are one cause for their lower cor-relation with ∆qC compared to the broadband distortion experiment. Theyindicate that subjects do not always answer consistently in the paired com-parison task with narrowband distortion stimuli. The narrowband distortionalgorithm is therefore not optimal for assessing the subjective band-specificspeech quality.

The spectral weighting characteristic Winc was found to be optimal for theapplication of qC to predict the Absolute Categry Rating MOS of data basescontaining codecs of several different types. This might suggest a differentsensitivity for speech signal distortions prevailing in different spectral regions.However, the current modeling results for the detection of the two types ofband-specific modulated noise distortions indicate, that Winc is not appro-

Page 71: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.5 Discussion 63

0

2

4

6

8

10

0.98 0.985 0.99 0.995 1

rank

qC

narrowband experiment, sentence m1

500-5

500-10

500-15

1500-5

1500-10

1500-15

2500-5

2500-10

2500-15

3500-5

3500-10

3500-15

0

2

4

6

8

10

0.98 0.985 0.99 0.995 1qC

narrowband experiment, sentence m2

500-5

500-10

500-15

1500-5

1500-10

1500-15

2500-5

2500-10

2500-15

3500-5

3500-10

3500-15

Figure 3.16: Quality ranks for all 12 test sentences plotted against the cor-responding qC with constant weighting for the narrowband experiment. Ateach data point, the center frequency fc and the relative modulation depthabove threshold of are indicated. Left panel: sentence m1, right panel: sen-tence m2.

0

2

4

6

8

10

0.97 0.975 0.98 0.985 0.99 0.995 1

rank

qC

broadband experiment, sentence m1

500-5

500-10

500-15

1500-5

1500-10

1500-15

2500-5

2500-10

2500-15

3500-5

3500-10

3500-15

0

2

4

6

8

10

0.97 0.975 0.98 0.985 0.99 0.995 1qC

broadband experiment, sentence m2

500-5

500-10

500-15

1500-5

1500-10

1500-15

2500-5

2500-10

2500-15

3500-53500-10

3500-15

Figure 3.17: Quality ranks for all 12 test sentences plotted against the corre-sponding qC with constant weighting for the broadband experiment. At eachdata point, the center frequency fc and the relative modulation depth abovethreshold of are indicated. Left panel: sentence m1, right panel: sentencem2.

Page 72: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

64 Chapter 3: Perception of speech quality distortions

priate for describing such a frequency-dependent sensitivity. Also, at leastfor the distortions employed in the current experiments, the modeling resultspoint to a constant weighting characteristic for the relative importance ofcritical bands for the perception of speech quality preference. However, therelative importance of different bands found in this study might only applyto the two specific methods of introducing a distortion.

The design of the preference rating experiments is justified by the mod-eling results obtained by qC with constant weighting: The generation of thestimuli was performed for test parameters Q selected at equidistant valuesrelative to the detection thresholds. If the speech quality perception would in-corporate a proposed non-uniform spectral sensitivity across different bandsof the internal representation, ∆qC with constant would not be able to pre-dict the preference and rank score data as observed. Although Winc was usedin chapter 2 for predicting the Absolute Category MOS data, this does notconstitute a contradiction: Obviously, the increasing weights of Winc accountfor a perceptual/cognitive difference in the speech quality assessment of typ-ical codecs versus MNRU’s. The spectral energy at the highest frequencies isdominant in the MNRU-processed signals compared to the codec-processedsignals. The band-specific distortions introduced by the broadband type ofdistortion have approximately the same spectral shape that does not dependon the generating frequency band. Hence, no frequency-dependent weightingfunction of the internal representation can be derived from the comparisonbetween the subjective data and the prediction results. The narrowbanddistortion algorithm is not optimal for the investigation of the speech qual-ity of low-bit-rate speech codecs because the narrowband distortions do notresemble those of the codecs. A different weighting across frequency shouldhave emerged from the narrowband preference experiments because supra-threshold distortions at different spectral regions were compared with eachother. The fact that a constant weighting is optimal to model the narrow-band preference data shows that the increasing weighting for modeling of theAbsolute Category ratings in chapter 2 accounts for a cognitive effect ratherthan for a property of the internal representation.

3.6 Summary and conclusion

In this chapter, detection threshold experiments and speech quality assess-ment experiments were described for two types of band-specific modulatednoise distortions.

For two input sentence stimuli, distortion thresholds were measured asa function of the center frequency of the modulated noise algorithm. The

Page 73: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

3.6 Summary and conclusion 65

thresholds could be modeled by the objective speech quality measure qCpresented earlier in this thesis, when a constant weighting of the criticalbands of the internal representation of the test signals is employed. Then,qC has a constant absolute value at threshold, which can be used to predictthresholds.

Taking the thresholds from the first experiment, a set of test signals wasgenerated that contained the same modulated noise distortions at supra-threshold levels. The modulation depth levels were chosen at constant rel-ative levels above the individual thresholds to ensure equal audibilities ofthe signal manipulations at different center frequencies. The pairwise speechquality preference rating obtained from the second experiment could be mod-eled by the difference ∆qC of the corresponding qC of the sentences of eachpair. The speech quality rank order could be predicted by qC of the re-spective sentences. Also for the prediction of the speech quality, which isa clear supra-threshold phenomenon, it was necessary to use the constantweighting, i.e., no weighting at all, in order to be able to predict the speechquality with good accuracy, whereas the weighting Winc fails to account for ahypothetical relative importance of the filter channels of the signal’s internalrepresentation.

The results of this study give another confirmation for the assumptionthat the internal representation of a signal contains the relevant informa-tion not only for signal detection experiments but also for supra-thresholdphenomena like speech quality.

Page 74: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Chapter 4

Continuous assessment oftime-varying speech quality:method and model prediction 1

Abstract

This paper addresses the question whether subjects are able to assess the perceivedtime-varying quality of speech material continuously. A method motivated by deRidder and Hamberg [J. Opt. Soc. Am. A, 12, 2573–2577 (1995)] is introducedwhich is characterized by a subjective continuous rating of the perceived speechquality by moving a slider along a graphical scale. The usability of this method isillustrated with an experiment in which different sequences of sentences were de-graded in quality with a Modulated Noise Reference Unit. The modulation depthwas varied with time and the subject’s task was to assess the perceived quality. Theresults indicate that subjects can monitor speech quality variations very accuratelywith a delay of approximately 1 s. An objective speech quality measure based onan auditory processing model was applied to predict the subjective speech qualityresults. The speech quality measure qC described in chapter 2 of this thesis wasmodified to allow for time-dependent objective measurement of the speech quality.The averaged subjective response data could be modeled by the scale-transformedand lowpass filtered measure qC(t) with a high degree of accuracy.

1A modified version of this chapter has been submitted for publication in the J. Acoust.Soc. Am.: Hansen and Kollmeier (1998) “Continuous Assessment of time-varying speechquality”

66

Page 75: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.1 Introduction 67

4.1 Introduction

In the development and optimization of speech codecs in mobile telephonenetworks, it is essential to assess the perceived speech transmission quality ofthe system under test. A number of methods are currently available to assessthe (subjective) quality of a speech transmission channel. For more recent re-views on several aspects of speech quality assessment see, e.g., (Quackenbushet al., 1988; Kitawaki, 1990; Sotscheck, 1992; Dimolitsas, 1993; Jekosch, 1993;Kroon, 1995). The ITU-recommendations P.800 and P.830 (ITU-T, 1996a;1996d) describe assessment methods that are applicable for the quality as-sessment of digital nonlinear speech transmission systems. The most commonmethods present subjects with short sequences of speech material and askthe subject about their quality impression on a given rating scale. In thesemeasurements, one rating is requested from the subject for one typical testsignal. It consists of two short sentences separated by speech pauses.

This limitation to relatively short test stimuli in these methods hinderstheir applicability for the assessment of realistic time-varying systems. Inthese systems, time-dependent transmission conditions are commonly expe-rienced, e.g., due to fading radio transmission or due to variable rate speechcoding caused by changing load of DCME (Digital Circuit MultiplicationEquipment) devices. Under these conditions, several question arise: If thephenomenon of a short-time instantaneous quality percept exists, how is itrated by subjects? How is the overall quality impression related to the time-varying course of the short-time instantaneous quality?

In order to overcome these limitations, the first necessary step to answerthese questions is to investigate if and to which degree subjects are able to as-sess the time-varying speech quality instantaneously. Experiments on imagequality assessment have successfully shown this ability in the visual sensorydomain (Hamberg and de Ridder, 1995; de Ridder and Hamberg, 1997). Al-though their results seem promising, it is not clear if subjects show the sameability in the auditory modality. It is known, that the auditory system is“faster” than the visual system in the sense that we are able to perceivetemporal changes in the range of a few milliseconds. In the visual systemtemporal intensity variations above 60 Hz are unnoticeable. On the otherhand, speech quality perception might be associated with much more centraland probably “slower” cognitive processes. In addition, the motor actionof the response task might limit the applicability of corresponding auditoryquality experiments. Experiments on continuous loudness judgement havebeen performed by Namba et al. (1988), Kuwano and Fastl (1989), Fastl(1991), Weber (1991), and others. In these studies, an experimental setupsimilar to the one described in this study was used for loudness judgement

Page 76: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

68 Chapter 4: Continuous assessment of speech quality

of sound stimuli with a length of up to 17 minutes. These studies were wereaimed to find a relation between the averaged instantaneous loudness andoverall loudness judgement.

The current study introduces a method for continuously assessing time-varying speech quality perception. The validity of the method is shown withan experiment that makes it possible to relate subjective time-varying qual-ity results to a given controlled time-varying target quality (Hansen andKollmeier, 1998a). Secondly, the subjective results are related to model pre-dictions obtained from a modified objective speech quality measure (Hansenand Kollmeier, 1997b) based on a perception model (Dau et al., 1996a;1996b).

The current method is motivated by similar experiments by Hamberg andde Ridder 1995, 1997, 1997, in the field of video image quality assessment. Inthis study, the emphasis is put on the general design and the demonstrationof the feasibility of the continuous assessment method for speech qualityassessment purposes.

One important prerequisite for the continuous assessment method is thesubjects’ ability to reliably assess the quality of instantaneous events inspeech transmission. Only then can the continuous rating in response toongoing speech be understood in a causal manner. Therefore, in the first ex-periment, short isolated speech elements (i.e., segmented words taken fromcomplete sentences) were presented to subjects with the task to rate theperceived short-time speech quality.

In the second experiment the subjects were presented with a long se-quence of sentences and their task was to continuously rate the perceivedinstantaneous quality as closely as possible.

4.2 Experimental Setup

4.2.1 Speech stimulus material

The speech material was selected from the reference material of the Ger-man language ETSI Halfrate Selection Test (ETSI, 1991; 1992). These sen-tences are part of the corpus “Satze fur Sprachgutemessungen mit deutscherSprache” (Sotscheck, 1984), recorded by the Deutsche Telekom AG. Thematerial was digitized with 16 bit resolution and sampled at 16 kHz andsubsequently telephone-band-pass filtered by the modified Intermediate Ref-erence System (IRS)-filter (ITU-T, 1996c). All sentences were calibrated toan equal Active Speech Level (ASL)(ITU-T, 1996c) of -30 dBov. From thereference material, five sentences of each of the two male (m1,m2) and two

Page 77: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.2 Experimental Setup 69

female speakers (f1,f2) were chosen randomly.The speech stimuli were diotically presented to the subjects via Sennheiser

HDA 200 headphones in a sound-insulated chamber. The listening level wasadjusted to an active speech level of 75 dB SPL.

4.2.2 Generation of quality degradation

A controlled speech quality degradation was generated by means of a Mod-ulated Noise Reference Unit (MNRU) (CCITT, 1989). This device is com-monly used as a reference system for speech quality assessment. The principleof the MNRU is shown in Fig. 4.1.

Attenuator/amplifier

Attenuatoror amplifier − Q dB

Filter100−3400 Hz

Input Output

0−20 kHz noise source

Figure 4.1: Block diagram of the Modulated Noise Reference Unit (MNRU).The speech input signal is modulated by a wide-band white noise. The modu-lation depth m and the MNRU-parameter Q are related by Q = −20 log10m.

The MNRU modulates the input speech signal s(t) according to x(t) =s(t) ·(1+m ·n(t)) where n(t) is a white noise with unity variance and m is themodulation depth. For the narrow-band MNRU, the modulated signal x(t)is subsequently band-pass filtered to telephone-bandwidth from 100-3400 Hz.

4.2.3 Experiment 1: Quality assessment of isolatedwords

The 20 sentences uttered by the different speakers were (hand-)segmentedinto their single word constituents. This led to a total number of 102 words,

Page 78: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

70 Chapter 4: Continuous assessment of speech quality

i.e., 26 words for speaker f1, 27 for f2, 23 for m1, and 26 for m2. The dura-tion of the words varied from 0.135 s for the very short monosyllabic word“die” to 0.911 s for the four-syllabic word “gerettet”. Despite the big vari-ance in the duration of the word stimuli, it was decided not to partition thesentences into segments of equal length, e.g., syllables. This would otherwisehave lead to partly unintelligible segments at least if the overall quality waspoor. Such a percept of a degraded intelligibility would dominate the assess-ment of transmission quality. This dominance effect was observed during thecomparison of the subjective quality and intelligibility of speech processed bydifferent hearing aids algorithms (Preminger and Van Tasell, 1995). There,speech quality was highly correlated with speech intelligibility as long as theintelligibility of different test stimuli was not high and constant.

For the generation of stimuli with degraded quality, the Modulated NoiseReference Unit (MNRU) (CCITT, 1989) was applied. Each word of eachspeaker was processed by the MNRU at seven different levels of the (negative,logarithmic) modulation depth parameter Q = (−1) · 20 log10 m= 0, 5, 10,15, 20, 25, 35 dB. Note that the standardized MNRU-parameter Q is positivefor modulation depths m < 0. The procedure yielded 714 words in total atvarious levels of speech quality degradation which were expected to rangefrom near reference quality (Q=35 dB) to very poor quality (Q = 0 dB).

All words that belonged to one of the four speakers were arranged into aseparate list of words. In one run, all words from one list were presented tothe subject in random order. The task was to give a rating on a five pointAbsolute Category Rating (ACR) scale, which is recommended by the ITU(ITU-T, 1996d). The categorical adjectives were “ausgezeichnet”, “gut”,“ordentlich”, “maßig”, “schlecht”2. The subjects had to select one of thesecategories that were presented on a monitor in front of them.

All subjects were familiarized with the setup and had read a writtenexplanation. Subsequently they performed one run with each of the fourlists. All runs of one subject took place on the same day, with pauses betweendifferent runs. The whole experiment was computer-controlled and self-pacedby the subjects.

2These adjectives have been standardized for many different languages and are recom-mended by the ITU. The equivalent English translations are “excellent”, “good”, “fair”,“poor”, and “bad”.

Page 79: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.2 Experimental Setup 71

4.2.4 Experiment 2: Continuous assessment of timevarying speech quality

All 20 sentences were concatenated into one long utterance. This resultedin a long speech flow with short pauses in between. The durations of thepauses ranged from 0.1 s to 0.39 s. The different sentences did not relate toeach other with respect to their linguistic contents. Two different orders ofconcatenation were used to produce the source stimuli named “A” and “B”.For source stimulus A the order of the speakers was f1-f2-m1-m2, while forB it was m1-m2-f1-f2, i.e., B was cyclically shifted relative to A by half itslength.

A time-varying speech quality was generated by means of a MNRU whichwas modified to allow for a time-varying modulation-depth parameter Q.Two different target quality profiles, Q1(t) and Q2(t) were generated by defin-ing the MNRU-parameter Q over time. The target profiles Q1 and Q2 areshown in Fig. 4.2.

0

10

20

30

40

0 5 10 15 20 25 30 35 40

Q [d

B]

time [s]

profile 1

Q1

0

10

20

30

40

0 5 10 15 20 25 30 35 40

Q [d

B]

time [s]

profile 2

Q2

Figure 4.2: Target quality profiles Q1(t) and Q2(t) that were used to controlthe time-varying quality introduced by the modified MNRU. Increasing Qindicates increasing speech quality.

Page 80: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

72 Chapter 4: Continuous assessment of speech quality

The target quality profiles were designed according to the results fromexperiment 1 and from results of pre-investigations that showed that un-trained subjects had difficulties in assessing a constantly changing quality.Therefore, Qi(t) were chosen to have constant sections, considerable instan-taneous changes, and also linear decreases or increases over time intervals of2-3 s. The target values for the different sections of constant Q were chosenat approximately equally spaced levels on the expected perceptual qualityscale. Four different test stimuli were generated from the combination of twosource stimuli and two modulation depth profiles.

In one run, two repetitions of each of the four test stimuli were presentedto the subjects in random order. Subjects were instructed to assess theperceived instantaneous speech quality by positioning a slider as closely aspossible on a continuous linear scale. The scale was labeled into the same fivecategories as in experiment 1, i.e., with the German equivalents of “excel-lent”, “good”, “fair”, “poor”, and “bad”. The slider is depicted schematicallyin Fig. 4.3.

schlecht

massig

ordentlich

gut

ausgezeichnet

Figure 4.3: Schematic representations of the slider used to record the sub-jective continuous quality rating.

In the lowest position, the marker on the mid point of the slider points tothe lowest of the horizontal lines, and correspondingly for the highest sliderposition. The position r(t) of the slider on the scale of length l=100 mm wassampled at a rate of 8 kHz and subsequently downsampled to 8 Hz. Thesedownsampled data are denoted by rj(t) for subject j, or by rij = rj(i∆t).

Page 81: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.3 Results 73

4.2.5 Test subjects

All test subjects in the experiments were unpaid volunteers. They had nor-mal hearing as defined by a hearing loss < 15 dB at all audiologic standardfrequencies and no history of hearing problems. Five male subjects aged be-tween 25 and 42 years participated in experiment 1. Two were German nativespeakers and three were non-native German speakers. All subjects were ex-perts in speech processing technology, but were not familiar with this kindof experiment nor with the test stimuli. One female and ten male subjectsaged between 25 and 30 years participated in experiment 2. Two were non-native German speakers and nine were native speakers. All these subjectshad experience in psychoacoustic listening tests, but only two were expertsin speech processing technology. Two of the subjects also participated inexperiment 1.

4.3 Results

4.3.1 Experiment 1: Quality assessment of isolatedwords

Most subjects reported that they could not fully understand a few of the veryshort test words in conditions of severely degraded quality. In these casesthe worst category was assigned by them. These data were not excludedfrom further analysis, although intelligibility was obviously assessed ratherthan speech quality. However, the response behaviour can be regarded asconsistent.

The inter-individual variability proved to be very small in comparisonto the variability across words. For each of the four lists of words of onespeaker, the quality ratings given to the test stimuli processed at the samelevel of Q were therefore averaged across subjects and across all words in thelist. In addition, the average across all speakers, all words, and all subjectswas calculated. The results are shown in Fig. 4.4 for the individual speakersand in Fig. 4.5 for the overall average (filled symbols). Fig. 4.5 also showsthe subjective data for the assessment with the standard MOS method withsentence pairs (open squares) taken from the ETSI Halfrate Selection Testdata base (ETSI, 1992). They result from 24 subjects and 4 different speakersfor each level of Q.

The data for isolated words differ slightly between the four speakers. Forspeaker m1, the standard deviations are somewhat larger than for the otherthree speakers. For speaker f2, a non-monotonic result is observed due to

Page 82: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

74 Chapter 4: Continuous assessment of speech quality

1

2

3

4

5

0 5 10 15 20 25 30 35 40

qual

ity [M

OS

]

Q [dB]

speaker f1

f1

1

2

3

4

5

0 5 10 15 20 25 30 35 40

qual

ity [M

OS

]

Q [dB]

speaker f2

f2

1

2

3

4

5

0 5 10 15 20 25 30 35 40

qual

ity [M

OS

]

Q [dB]

speaker m1

m1

1

2

3

4

5

0 5 10 15 20 25 30 35 40

qual

ity [M

OS

]

Q [dB]

speaker m2

m2

Figure 4.4: Mean quality ratings obtained for isolated words as a functionof the MNRU-parameter Q. Each panel shows the average rating across allwords and all subjects for the test stimuli of one speaker (mean value andstandard deviation).

Page 83: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.3 Results 75

a higher quality rating at Q=20 dB than at Q=25 dB. For speaker m1,a somewhat lower rating is observed at Q=10 dB compared to the otherthree speakers. However, the MOS data for the stimuli uttered by the differ-ent speakers do not show any systematic dependency on the speaker. It istherefore reasonable to average across the different speakers. The standarddeviations of the overall MOS data for isolated words range from 0.11 to 0.32MOS units. This is in the same range as the standard deviation of the MOSin the well-established relation for standard sentence pairs, that ranges from0.06 to 0.35 MOS units for the data given in Fig. 4.5.

1

2

3

4

5

0 10 20 30 40 50Q [dB]

Average across all speakers

10

30

50

70

90

sliderposition

[mm]

quality

[MOS]

isolated wordssentence pairs

polynomial fit

Figure 4.5: Average quality rating for isolated words averaged across allspeakers. For comparison, the MOS obtained with standard sentence pairsare shown, taken from the ETSI Halfrate Selection Test data base (ETSI,1992). The dashed line shows the second order polynomial fit used to trans-form Q to the slider position in the subsequent experiment

4.3.2 Experiment 2: Continuous assessment of time-varying speech quality

The subjects had no difficulties with the continuous quality rating task exceptfor their reported inability to reduce a certain delay between the auditorypercept and their slider movement.

In order to quantify this delay and to correlate the recorded slider position

Page 84: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

76 Chapter 4: Continuous assessment of speech quality

r(t) with the corresponding time-varying “target”-quality profile functionQi(t), a transformation of Qi(t) to a “target”-slider profile function Ri(t)was used. To do so, a second-order polynomial fit to theQ-MOS data gainedin experiment 1 with isolated words was employed 3 (cf. Fig. 4.5).

For each individual subject, the sliding response data to all stimuli thatwere MNRU-processed with the same target profile were averaged and crosscorrelated with each of the two transformed “target” slider profile functionsRi(t). The cross correlation functions are shown in Fig. 4.6 for profile R1

and Fig.4.7 for profile R2. The position of the maximum was found at timesranging from 0.78 s to 1.09 s for profile R1 and from 0.90 s to 1.22 s for R2.

0.4

0.5

0.6

0.7

0.8

0.9

1

-15 -10 -5 0 5 10 15Time [s]

cross correlation functions, profile 1

Figure 4.6: Cross correlation function between averaged response data andthe corresponding transformed “target” slider position R1(t) (derived bytransforming the target quality profile Q1(t)), for the 11 different subjects.

The small spread of the maxima of the cross correlation functions acrosssubjects justifies averaging the response data from all subjects. Two differentaveraging methods were used: In the direct averaging method, all data inresponse to the same test stimulus were averaged without any correction fordifferent use of the response scale by the subject. To account for this possibledifference in the response data, in the second averaging method the data ofindividual subjects were linearly transformed prior to averaging. The lineartransformation was performed in order to equalize the mean and standard

3The MOS data were then linearly transformed to the slider position R by defining:R = (−10 + 20 ·MOS) mm

Page 85: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.3 Results 77

0.4

0.5

0.6

0.7

0.8

0.9

1

-15 -10 -5 0 5 10 15Time [s]

cross correlation functions, profile 2

Figure 4.7: Cross correlation function between averaged response data andthe corresponding “target” slider position R2(t), for 11 different subjects.

deviation of the recorded response data across subjects 4.

The results for the z-score-transformation averaging are shown in Fig. 4.8.In the four panels, r(t) averaged across different subjects and repetitions isdisplayed with a thin-dotted line for the four combinations of input stimulusand target quality profile. The grey-shaded area surrounding the averagedata indicates the standard deviation. The target slider profiles R1(t) andR2(t) are depicted with a solid continuous line. The averaged subjectiveresponse data follow the target quality very closely with a delay of about 1 s.The subjects are obviously able to distinguish the different constant qualitylevels within the target profiles.

Note that the first four samples of the response data, beginning at t=0 s,have values close to r ≈ 0 mm. This is an artifact of the interpolation in thedownsampling process from 8 kHz to 8 Hz sampling frequency. Consequently,

4Let rij denote the sampled response data of subject and/or repetition j, j = 1 . . .Mat time t = i∆t, i = 1 . . .N . For all subjects and repetitions the mean across time µrj= 1

N

∑Ni=1 rij and the standard deviations σrj =

√1

N−1

∑Ni=1(rij − µrj )2 are calculated.

From the µrj and σrj the pooled mean and standard deviation µy = 1M

∑Mj=1 µrj and

σy = 1M

∑Mj=1 σrj are calculated and finally the transformations are performed individually

for each subject/repetition according to yij = σyσrj

(rij − µrj ) + µy. In the literature this

transformation is known under the name “z-score-transformation” (Chatfield, 1983). Theterm should not be confused with the z-transformation from signal processing.

Page 86: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

78 Chapter 4: Continuous assessment of speech quality

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

transformed averages: stimulus A, profile 1

subj.target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

transformed averages: stimulus B, profile 1

subj.target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

transformed averages: stimulus A, profile 2

subj.target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

time [s]

transformed averages: stimulus B, profile 2

subj.target

Figure 4.8: Averaged transformed response data from 11 subjects for thefour different test stimuli. The thin-dotted line shows the mean and the grey-shaded area indicates the standard deviation of the subjective responses.

Page 87: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.3 Results 79

the initial standard deviation is too small. Additionally, a larger standarddeviation in the response data is observed in the beginning than at the endof the stimulus. For profile 2, a deviation of the mean response from thetarget slider position R2(t) is also observed at the beginning of the stimulus.This can be explained by the fact that the subjects were not instructed tomove the slider back to a specified starting position before the presentationof the next stimulus began.

A small difference in the data is encountered between the source stimuli Aand B when processed with the same target profile quality (cf. Fig. 4.9). One

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

profile 1: direct response averaging

AB

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

time [s]

profile 2: direct response averaging

AB

Figure 4.9: Responses to source stimuli A (solid line) versus B (dashed line),directly averaged across subjects for each of the two target profile.

reason for this deviation might be the fact that at one point in time a femalespeaker was present in stimulus A while a male speaker was present stimulusB, and vice versa. This deviation, however, is negligible in comparison tothe intra-individual and inter-individual variability of the subjects’ responses(shaded area in Fig. 4.8). A similar result as shown for the direct averagingmethod in Fig. 4.9 is obtained for the transformation averaging method.

Fig. 4.10 gives the standard deviations resulting from the two averaging

Page 88: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

80 Chapter 4: Continuous assessment of speech quality

methods (direct averaging and z-score-transformed averaging of the subjects’responses). The standard deviation of the response data is smaller or at mostequal for the transformation averaging compared to direct averaging. Thisrank order inverts only occasionally and only for very short intervals. Therelative difference of the standard deviations is on average in the range of 25-50%. An F-test was performed on the data across all subjects, repetitions,and across time for each of the four conditions (stimulus A-profile 1, B-1,A-2, and B-2). The difference in the standard deviation is not significant forthe data in response to the combination of stimulus A and target profile 1,while for the other three combinations, the difference is highly significant 5.

If the response data are additionally averaged across the two stimuli, thedifference of the standard deviation between direct averaging and transfor-mation averaging is highly significant for both target profiles.

4.4 Discussion

4.4.1 Quality scores based on isolated words

Experiment 1 was designed to investigate whether subjects are in principleable to assess an instantaneous speech quality on the basis of short timeintervals. This step was necessary in order to understand the response be-haviour of the subjects in experiment 2 where the time-varying speech qualityassessment method was investigated.

Hamberg and de Ridder (1995) introduced the continuous assessmentmethod to video quality assessment. They degraded the quality of an imageover time by blurring the picture with a time-varying filter, i.e., the sceneof the picture remained the same, while the perceived quality changed withtime. They found that subjects were able to monitor the perceived qualityalmost instantaneously (with respect to the data sampling rate at 1 Hz).

Due to the different nature and dimensionality of visual versus auditorystimuli it is sometimes but not always possible to find equivalent paradigmsfor visual vs. auditory stimulation.

Experiment 1 was designed in order to investigate a comparable abilityof subjects in the auditory domain: The concept of a still-picture was trans-ferred to the presentation of isolated short words. It was decided not torepeat the same word periodically. Instead of time-varying image quality ofa non-moving image, the isolated, short words were presented at differentquality levels during one run of the experiment.

5F-test: A-1: F = 1.021, P (F < f) = 0.196, B-1: F = 1.098, P (F < f) = 4 · 10−5,A-2: F = 1.152, P (F < f) < 10−6, B-2: F = 1.111, P (F < f) < 10−6.

Page 89: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.4 Discussion 81

05

1015202530

0 5 10 15 20 25 30 35 40stan

dard

dev

iatio

n [m

m]

direct vs. transformed averaging. stimulus A, profile 1

lin. transf.direct

05

1015202530

0 5 10 15 20 25 30 35 40stan

dard

dev

iatio

n [m

m]

direct vs. transformed averaging. stimulus B, profile 1

lin. transf.direct

05

1015202530

0 5 10 15 20 25 30 35 40stan

dard

dev

iatio

n [m

m]

direct vs. transformed averaging. stimulus A, profile 2

lin. transf.direct

05

1015202530

0 5 10 15 20 25 30 35 40stan

dard

dev

iatio

n [m

m]

time [s]

direct vs. transformed averaging. stimulus B, profile 2

lin. transf.direct

Figure 4.10: Standard deviations across 11 subjects resulting from the di-rect averaging (dashed line) and transformation averaging (solid line) of theresponse data.

Page 90: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

82 Chapter 4: Continuous assessment of speech quality

The consistent results of experiment 1 indicate that subjects indeed areable to form a short-time quality percept on the basis of short isolated wordstimuli. The instantaneous quality was found to be independent of the genderof the speaker and of the word length. However, there was a tendency forvery short words at the very low quality end to be primarily assessed on thebasis of the speech intelligibility, which in this case was consistent with poorspeech quality.

It is noteworthy that the MOS results for isolated words were somewhathigher than the MOS obtained with the standard sentences as test stimuli forQ < 20 dB. For Q > 20 dB this tendency was reversed: With isolated shortwords the highest quality was rated at about 4 MOS units, while for sen-tences, the highest subjective quality rating reached 4.5 MOS (cf. Fig. 4.5).Both for the isolated words and for sentences, the transition from monoton-ically increasing MOS with increasing Q to the asymptotic MOS was foundat a value of Q of about 30 dB. Both experiments showed the same standarddeviation across subjects at different levels of Q.

This deviation in slope between single speech elements (e.g., words) andsentences is also found for the discrimination function in speech intelligibilitytests at a less favourable SNR than the one employed here. The steepertransition of the sentences is probably due to a higher number of independentspeech elements and the redundancy that reduces the variability connectedto the quality assessment of a single word.

In summary, the results of experiment 1 support the hypothesis thatsubjects can reliably assess the short-term quality impression connected tobrief auditory speech events. The results are in line with equivalent findingsin the visual domain and a high correspondence to MOS for sentence testmaterial is found.

4.4.2 Continuous scaling method

Experiment 2 was a straightforward extension of experiment 1:It investigated continuous time-varying speech quality percep-tion instead of short-time quality. It corresponded closely tothe video quality experiments by de Ridder and Hamberg (1997),Hamberg and de Ridder (1997).

All words used in experiment 1 were the constituents of the 20 sentencesemployed in experiment 2. The subjects’ task of assessing 40 s ongoingspeech resembled very much the listening-only situation of a realisticconversation. The quality degradation introduced in the experiment waschosen in order to provide parts with constant quality, abrupt changes and

Page 91: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.4 Discussion 83

linear decrease and increase of the MNRU quality parameter Q over time.The amount of quality variations over time was probably unrealistic, i.e., anexaggerated amount of change in a relatively short duration of 40 s. Howeverthe results of Hamberg and de Ridder provide evidence that subjects mightbe able to assess time-varying speech quality even when less variability inquality is employed over a much longer duration of the stimulus material. Inone experiment (de Ridder and Hamberg, 1997) they used a 50 s video sec-tion with a comparable amount of quality variability to that in experiment 2,while in another experiment (Hamberg and de Ridder, 1997) they presentedmaterial with a lower degree of variability in quality for a 300 s stimulusduration. Subjects were equally able to assess the time-varying quality in thetwo experiments. The delay between target slider position and subjective re-sponse corresponds very well to the results of de Ridder and Hamberg (1997)who also observed a delay of approximately 1 s with only small a deviationacross subjects. However, in continuous loudness judgement experiments,e.g., Weber (1991) found a larger spread of this delay across subjects, rangingfrom 0.5 s to 1.5 s, though a delay of up to 3 s was observed in some subjects.

The results of experiment 2 show that subjects can assess time-varyingspeech quality continuously with a high degree of reliability and reproduce-ability. It is remarkable that all subjects seem to make use of the responsescale in a reproduceable way. This is indicated by the low intra-individualand inter-individual deviation observed in the directly averaged data asshown in Fig. 4.9. This figure shows that subjects respond to different sourcestimuli processed with the same time-varying quality profile in an almostindistinguishable manner. The deviation of the directly averaged responsesto stimulus A and B is much smaller than the standard deviation acrosssubjects for either of the stimuli. The fact that this result is found withoutany individual linear response scale transformation strongly confirms theapplicability of the new method for time-varying speech quality assessment.Fig. 4.9 shows that the averaged responses have an upper bound at about80 mm on the quality scale. Subjects seem to take this 80 mm position as ananchor point corresponding to the highest perceived quality. It is interestingto note that exactly the same observation was made by Hamberg and deRidder (1995, 1997) in several of their experiments. This may indicate acommon strategy involved in the continuous assessment of visual or auditoryquality perception. The value of 80 mm also corresponds well to the linearlytransformed quality of 4.5 on the MOS scale, which is the highest ratingtypically found in standard quality assessment experiments with sentencepairs.

Page 92: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

84 Chapter 4: Continuous assessment of speech quality

Despite the observation of the highly reproduceable use of the responsescale by the subjects, even without any linear correction, the transformationaveraging method should be applied. As can be expected, the individuallinear scale transformation prior to averaging should reduce any individualbias or incomplete usage of the response scale. In our experiments, the stan-dard deviation across subjects was reduced by 25 to 50 % on average, com-pared to the direct averaging method (cf. Fig. 4.10). This holds especiallywhen untrained subjects participate in the experiments. The individual scaletransformation enhances the consistency among subjects and thus limits thepotential influence of malperforming subjects.

4.5 Modeling continuous scaling responses

by an objective speech quality measure

The results so far indicate that the continuous assessment of time-varyingspeech transmission quality appears to be a well-defined psychoacousticaltask. Therefore, an attempt was made to model the subject’s performancein this task based on a psychoacoustical preprocessing model.

In previous studies (cf. chapters 2 and 3 of this thesis) an objective speechquality measure based on a model of the “effective” signal processing of theear was introduced and tested. Its application to speech quality predictionclosely follows the design of the underlying experiments:

The speech quality measure qC for stationary transmission conditions (cf.chapter 2) is based on the cross correlation (or mean squared difference) ofthe internal representation of the test signal and the reference signal, aver-aged across the whole utterance of two or more sentences. However, for theapplication to objective speech quality measurement the internal represen-tation is sampled with a period of 20 ms. These samples can be consideredas corresponding to the temporal states of excitation in response to an in-put signal. It is therefore straightforward to employ this auditory processingto model the continuous speech quality as assessed by the subjects in theexperiment described in the previous sections.

Note that, in doing so, we did not aim to model the behavioural aspectof the transformation of the quality percept to a motor action, or the delaythat is connected with it. Only a transformation of the objective qualitymeasure to the rating scale (with the mm-division) is considered here. Itcan simply be modeled by a fitted (nonlinear) scale transformation. Hence,the subjective averaged response data will show a certain delay relative tothe modeled objective instantaneous speech quality measure. This delay will

Page 93: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.5 Modeling continuous speech quality 85

supposedly be of the same order as the one observed between the target sliderprofile and the subjective sliding response, i.e., approximately 1 s.

For the objective modeling of the continuous speech quality the proce-dure is the same as described in chapter 2, section 2.3. The source stimuliA and B were taken as reference signals, the MNRU-processed signals withtime-varying speech quality according to the target profiles Q1 and Q2 weretaken as test signals. Reference and test signal were transformed to theirinternal representation exactly as described in chapter 2, section 2.3.3 up-toand including the last stage where downsampling of the internal represen-tation takes place by averaging consecutive non-overlapping frames across aduration τ=20 ms.

In order to investigate the influence of this parameter τ on the perfor-mance of the continuous speech quality measure, the internal representationwas averaged over frames of 20 ms duration (as before), and, additionally,over 100 ms.

The formal description of the resulting time-dependent speech qualitymeasures qC(t) (correlation measure) and qS(t) (distance measure) is given inAppendix C. Both time-dependent measures, qS(l) and qC(l) were mappedto the r(t) scale by means of a simple exponential transformation (cf. Ap-pendix C) and subsequently low-pass filtered at 0.5 Hz. These transformedspeech quality measures are denoted rqC(t) and rqS(t), respectively.

The results of the time-dependent quality prediction by the objectivemeasures rqC (t) for the frame duration τ=20 ms are shown in Fig. 4.11. Thedirectly averaged subjective response data (solid line), the model predictionby rqC(t) (dashed line), and the target slider profile (thin dashed grey line)are plotted for each of the four different stimulus conditions. The delay in thesubjective response data of approximately 1 s relative to the target profile hasbeen compensated for in the this figure in order to facilitate a comparison.The results for rqS(t) for frame duration τ =20 ms are shown in Fig. 4.12 inthe same style.

Regarding the choice of the parameter τ , no significant difference is foundfor the performance of either rqC(t) or rqS(t) in the prediction of the subjectivespeech quality data. The model predictions by rqC(t) and rqS(t) for a frameduration of τ =100 ms are shown in Appendix C in Fig. C.1 and Fig. C.2,respectively.

A general comparison of the figures (4.11 and 4.12) shows that the speechquality prediction by rqC (t) and rqS(t) coincides highly with the subjectiveresponse data and the target slider profile. After shifting the subjective re-sponse data by the average delay of 1 s, the model predictions, the targetslider profile, and the averaged subjective response data match each other

Page 94: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

86 Chapter 4: Continuous assessment of speech quality

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 1, tau=20ms

subj.rqC(t)target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus B, profile 1, tau=20ms

subj.rqC(t)target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 2, tau=20ms

subj.rqC(t)target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

time [s]

stimulus B, profile 2, tau=20ms

subj.rqC(t)target

Figure 4.11: Continuous-time speech quality prediction by the correlationspeech quality measure rqC(t) (dashed line) for a frame duration of 20 ms.The averaged subjective data (solid line) and the target slider profile (thindotted line) are plotted for comparison.

Page 95: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.5 Modeling continuous speech quality 87

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 1, tau=20ms

subj.rqS(t)

target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus B, profile 1, tau=20ms

subj.rqS(t)

target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 2, tau=20ms

subj.rqS(t)

target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

time [s]

stimulus B, profile 2, tau=20ms

subj.rqS(t)

target

Figure 4.12: Continuous-time speech quality prediction by the mean squareddifference speech quality measure rqS(t) (dashed line) for a frame duration of20 ms.The averaged subjective data (solid line) and the target slider profile(thin dotted line) are plotted for comparison.

Page 96: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

88 Chapter 4: Continuous assessment of speech quality

accurately especially at temporal positions of large jumps of the speech qual-ity, e.g., at t ≈ 2, 5, 8, 12, 15, 26, and 38 s for profile 1 and at t ≈ 3, 11, 24,32, 36, and 38 s for profile 2. A further inspection of the data shows thatthe deviation of the different model predictions from the subjective responsedata is smaller than the inter-individual standard deviation across subjectsfor all stimulus conditions most of the time.

The rippled structure observed in the rqC(t) and rqS(t) data does notresult from the frame length τ used for temporal averaging. The MNRUthat was used for the generation of the test stimuli introduces speech qualitydegradations that are comodulated with the input signals’ energy. The signaldegradations that are detected by the objective measures are therefore alsocomodulated with the speech activity. This is reflected in the fluctuations ofrqC(t) and rqS(t).

At intervals of low speech quality (i.e., r(t) <20 mm) within a targetprofile, the absolute values of the model prediction, target profile, and theaveraged subjective response match very closely. During intervals of highspeech quality (i.e., r(t) >60 mm) a slight deviation sometimes occurs be-tween the predicted speech quality and either the averaged subjective speechquality or the target profile. This effect depends on the choice of the transfor-mation from qC(t) or qS(t) to the mm-scale of r(t). The simple exponentialtransformation employed here (cf. Appendix. C) might not be optimal forthe purpose of transforming the complete range of qC (or qS) into the corre-sponding target slider positions. A two-parameter exponential fit results insmaller deviations between the transformed objective speech quality measureand the target value of the slider position.

For profile 2 a larger deviation of up to 20 mm between the correlationspeech quality measure rqC(t) and the target profile R2(t) is observed in theintervals 8 s ≤ t ≤ 11 s and 27 s ≤ t ≤ 32 s. However, it is interesting to notethat the averaged subjective response data deviate from the target profile atthe same time, so that rqC (t) matches the subjective data very accurately.On the other hand, intervals are also observed where rqC(t) matches thetarget profile very accurately but deviates from the subjective response, e.g.,at 2 s ≤ t ≤ 5 s for profile 1 and 1 s ≤ t ≤ 3 s for profile 2. In this case, thedeviation occurs shortly after the beginning of the stimulus which might bedue to an uncertainty of the subjects’ responses. The opposite is found forthe mean squared difference measure rqS(t) in the above mentioned intervals.rqS(t) matches the subjective response at the beginning of both profiles, andit matches the target profile rather than subjective response in the interval27 s ≤ t ≤ 32 s of profile 2. This effect may partly be explained by thenonlinear transformation of the two measures to the slider-scale which mightnot be optimal (see above).

Page 97: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

4.6 Summary and conclusion 89

In summary, a highly satisfactory prediction of the time-varying speechquality is obtained from either of the measures rqC(t) and rqS(t). This sup-ports the claim that the model can predict perceivable differences betweenthe original and degraded signal that vary in time based on their “internalrepresentations”.

4.6 Summary and conclusion

In this chapter, a method for continuously assessing time-varying speechquality was described and tested.

One experiment was concerned with the assessment of speech quality ofisolated short speech elements and the other with the continuous assessmentof time-varying speech quality of ongoing speech stimuli.

The results of the first experiment showed that subjects are able consis-tently to assess the speech quality of short speech elements, i.e., words, inisolation. The subjective quality rating found for words in this experimentwas compared with well-established results from the literature on the qualityof sentence pairs as test stimuli and a good agreement was observed.

The second experiment showed that subjects are able accurately and re-producibly to assess the time-varying quality of continous speech stimuli.The subjective results show a delay of approximately 1 s relative to theexpected target quality profile. This observed delay was constant within+/-0.15 s across all subjects and for all four different stimulus conditionswhich demonstrates the consistency in the subject’s results regarding theirtemporal behaviour.

An objective speech quality measure based on a psychoacoustic prepro-cessing model was applied to predict the subjective speech quality results.The speech quality measure qC (and qS) described in chapter 2 of this the-sis was modified to allow for time-dependent objective measurement of thespeech quality. The frame duration τ used for temporal averaging of qC(t)did not result in a significant difference in the performance of qC(t) in pre-dicting the subjective data for τ = 20 ms and τ = 100 ms. The averagedsubjective response data could be modeled by the scale-transformed measureqC(t) (and qS(t)) with a high degree of accuracy.

This provides indirect support for the hypothesis that the psychoacousticprocessing model extracts the relevant information about the perceivabledifferences of the two input signals.

The new continuous assessment method provides the system developerwith much more information about the performance/misperformance of an

Page 98: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

90 Chapter 4: Continuous assessment of speech quality

individual combination of coding and transmission devices. With continuousassessment it is possible to identify the point in time when an instantaneousquality degradation occurs which makes it possible to track the reason for itsorigin. Also, the threshold for the strength or for the duration of a certaindegradation might be measured more easily in this way than with standardMOS test methods. The continuous method can be expected to be muchmore sensitive to degradations that are of short duration.

In further studies on the continuous assessment method another aspectof quality assessment should be studied: It is often the case that the evalua-tor does not try to gain as much detailed information as possible but ratherwants an answer to the question “Which system has the better performancewith respect to a planned application?” This question leads to the inves-tigation of the relation of continuous scaling response r(t) to the “overallquality”, R, of the whole sequence. In recent studies on continuous videoquality assessment Hamberg and de Ridder (1997) investigated the relation-ship between these two entities. They found that the perceived instantaneousquality degradations should be summed up by raising the degradations to thepower p = 3 (instead of the typical p = 2 in squared difference measures),in combination with a decaying exponential weighting over time. In exper-iments on continuous loudness judgement a simliar effect was observed by,e.g. Fastl (1991) and Weber (1991). The average overall loudness was foundto be systematically larger than the arithmetic mean of the instantaneousloudness judgements which meant that the loud parts of a noise were morestronlgy weighted in determining the overall loudness than the soft parts. Asimilar study should be carried out for the case of continuous speech qualityassessment.

Page 99: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Chapter 5

Summary and Conclusion

In this thesis, a new objective speech quality measure is introduced andevaluated. The measure is based on a psychoacoustically validated auditoryprocessing model that is employed to transform the speech test signal and acorresponding reference signal to the so-called internal representation.

The auditory processing was employed in previous studies (Dau et al.,1996a; 1996b) to model detection thresholds in a variety of simultaneousand non-simultaneous psychoacoustical masking experiments. These studiesresulted in a set of optimal model parameters. For the application of theauditory processing within the objective speech quality measure, the sameparameters are employed.

The objective speech quality measure qC is obtained by computing theoverall correlation coefficient between the band-specifically weighted internalrepresentation of the test signal and the reference signal. The weightingfunction is optimized with the aim of maximizing the correlation between theresulting qC and the subjective MOS data for four different test data bases oflow-bit rate coded speech signals. The optimal weighting function exhibitsconstant weights for center frequencies below 1 kHz and increasing weights forincreasing frequencies which strongly emphasizes the highest channels. Withthis weighting, excellent performance of the new speech quality measure isfound in one test data base, and a generally high performance is found in thethree other data bases.

Apart from the band-specific weighting of the internal representation, aconsiderable downsampling of the signal representation to a sampling periodof 20 ms is introduced. The success of the model predictions indicates thata temporal resolution of 20 ms seems to be sufficient for the perception ofspeech quality. However, the downsampling can be considered as a minoralteration of the original processing compared to the effect of the weightingfunction. It is also shown, that the parameter set optimized for psychoacous-

91

Page 100: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

92 Chapter 5: Summary and Conclusion

tical modeling leads, at the same time, also to the highest performance ofthe corresponding speech quality measure. The nonlinear adaptation stagewithin the auditory processing, including the feedback loops, appears to bethe salient feature for the realistic modeling of the dynamic compressionproperties and temporal adaptation and masking effects in the auditory sys-tem.

The comparison of the new speech quality measure qC with the Percep-tual Speech Quality Measure (PSQM) in the version standardized as P.861by the ITU shows, that both measures are on average comparable in theirperformance. P.861 exhibits a more simple signal processing algorithm buta more refined distance calculation algorithm compared to the new measurepresented here.

The introduction of the band-specific weighting within qC leads to theassumption of a non-uniform spectral sensitivity in the perception of speechquality, which presumably is located beyond the peripheral signal transforma-tion on the auditory pathway. This assumption is tested in two experiments.Two types of a parameterizable band-specific modulated noise distortion areintroduced and applied to two reference speech stimuli.

Distortion thresholds are measured as a function of the center frequency ofthe modulated noise algorithms. The detection thresholds are modeled withonly small deviations by assuming a constant value of qC at threshold, whena constant weighting of the filter channels of the internal representation wasemployed, but not for a spectral weighting increasing with frequency. Thisfinding is in line with the model simulations of Dau et al. (1996b): Using thesame processing model, no weighting of individual bands of a multi-channelinternal representation were necessary to model detection thresholds in avariety of masking experiments.

In a second experiment, pairwise speech quality preferences of the band-specific distortions are assessed for a set of test sentences at levels of themodulation depth that were equispaced relative to the respective detectionthresholds. The subjective speech quality preference data do not show asystematic spectral dependency. They can be modeled by the respectivedifferences ∆qC for the two sentences of a test pair, if a constant weightingis applied. Similarly to the modeling for the detection threshold, ∆qC withincreasing spectral weighting does not yield correct model predictions for thesubjective quality preference data. Although a different spectral weighting isused in chapter 2 for predicting the Absolute Category MOS data, this doesnot constitute a contradiction: Obviously, the increasing weighting accountsfor a perceptual/cognitive difference in the speech quality assessment of typ-ical codecs versus MNRU’s. The spectral energy at the highest frequencies is

Page 101: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

93

dominant in the MNRU-processed signals compared to the codec-processedsignals. The band-specific distortions introduced by the broadband typeof distortion have approximately the same spectral shape which does notdepend on the generating frequency band. Hence no frequency-dependentweighting function of the internal representation can be derived from thecomparison between the subjective data and the prediction results. Thenarrowband distortion algorithm is not optimal for the investigation ofthe speech quality of low-bit-rate speech codecs because the narrowbanddistortions do not resemble those of the codecs. A different weighting acrossfrequency should have emerged from the narrowband preference experi-ments because supra-threshold distortions at different spectral regions werecompared with each other. The fact that a constant weighting is optimalto model the narrowband preference data shows that the increasing weight-ing for modeling of the Absolute Category ratings in chapter 2 accountsfor a cognitive effect rather than for a property of the internal representation.

To extend the current speech quality assessment methods limited to sta-tionary transmission conditions, a new method is introduced for continuouslyassessing the time-varying speech quality. The question is addressed whethersubjects are able to assess the perceived time-varying quality of speech ma-terial continuously. Two experiment are concerned with the assessment ofspeech quality of isolated short speech elements and the with the continuousassessment of time-varying speech quality of ongoing speech stimuli. Differ-ent sequences of sentences are degraded in quality by a modified ModulatedNoise Reference Unit with a modulation depth that was varied with time.Subjects can monitor the speech quality variations very accurately by movinga slider along a graphical scale. The subjective response data exhibit a delayof approximately 1 s relative to the expected target slider position. Subjectsperform the assessment task in a highly consistent way across subjects withrespect to the delay and the use of the slider rating scale. The objectivespeech quality measure qC is modified to allow for a time-depending qualityprediction. The averaged subjective response data can be modeled by thescale-transformed measure qC(t) with a high degree of accuracy. The samedelay of approximately 1 s is observed between the subjective assessmentresults and the time-depending measure. A low-pass filter at 0.5 Hz shouldbe applied to qC(t) in order to reduce its short time variability. The framedurations τ = 20 ms and τ = 100 ms used for temporal averaging of theinternal representation prior to the calculation of qC(t) do not result in asignificant difference in the performance of qC(t) in predicting the subjectivedata.

The new continuous assessment method provides much more in-

Page 102: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

94 Chapter 5: Summary and Conclusion

formation for the telecommunication system developer about the per-formance/misperformance of an individual combination of coding andtransmission devices. With continuous assessment it is possible to identifythe cause and the point in time when an instantaneous quality degradationoccurs. Also, the threshold for the strength or for the duration of a certaindegradation might be measured much easier in this way than with standardMOS test methods. The continuous method can be expected to be muchmore discriminative for degradations that are instantaneous or of shortduration. In further studies on the continuous assessment method the aspectshould be studied, how the continuous scaling response is related to the“overall quality” as an integrative entity.

The present speech quality measure qC can also serve as a perfor-mance measure in the evaluation of a hardware implementation of thepsychoacoustic processing model. A numerical analysis of the wordlenghtswithin the nonlinear feedback loops of the processing model (containinga division element) is very difficult, because standard SNR considerationsfor quantization effects can not be applied. The VLSI-chip implementationwhich is currently being developed can be considered as a modification ofthe processing model. It can be evaluated, in terms of its optimality ofthe resulting signal representation, by applying the processing algorithmto the speech quality prediction in the way described in this thesis. Aprecise model of the hardware algorithm with limited fix-point precision wasimplemented and employed for the calculation of the internal representationwithin a modified version of the objective speech quality measure qfix

C . Itis used to calculate the objective speech quality for the ETSI Halfrate andthe ITU 8kbit test data bases. In this way, the necessary quantization canbe obtained by iteratively selecting the wordlenghts of the internal numberrepresentation, until a satisfactory performance of the resulting speechquality measure qfix

C is reached regarding the prediction of the speech qualitydata bases

In conclusion, the results of this study show that the “perception model”serves as a general model for auditory signal detection and speech perceptionwith a wide applicability. Effects at the threshold of detection can be modeledas well as phenomena of speech perception that are related to clearly audiblesupra-threshold signal properties, such as nonlinear distortions produced bydigital telecommunication devices.

Page 103: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Appendix A

Reprint of“Prediction of Speech Qualitybased on PsychoacousticalPreprocessing Models”

This appendix contains an original reproduction of a paper that was pre-sented in march 1996 on the “Workshop on Quality Assessment in Speech,Audio and Image Communication” in Darmstadt, Germany. It is reprintedon the following pages.

The paper contains the results of earlier investigations on the influence ofdifferent preprocessing models in the application to objective speech qualitymeasurement (cf. 2.5.5).

95

Page 104: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

96 Appendix A: Reprint of ...

Prediction of Speech Quality based on Psychoacoustical Preprocess-ing Models

M. Hansen, B. Kollmeier, AG Medizinische Physik, Uni Oldenburg, 26111 Oldenburg

Abstract

This study investigates the implementation of �ve di�erent psychoacoustical preprocessing

models for measuring the speech quality of low-bit-rate codecs.

The principal method used for measuring the speech quality is the same for each of these

�ve preprocessing models: The preprocessing models are applied to transform the input and

output signal of a speech coding device to a so-called \internal representation" of the sound.

Di�erences in this internal representation of input and output signal are expected to correspond

to a decreased speech quality of the output signal.

At present, the most successful objective speech quality prediction for the ETSI Halfrate

Selection test was obtained by using a psychoacoustic preprocessing model which also enables

to simulate psychoacoustical threshold data in various conditions.

1 Introduction

In the development of objective speech quality prediction psychoacoustically motivated preprocess-

ing models have gained an increasing importance. The reason is that the \conventional" Signal-to-

Noise ratio measures and their derivates clearly fail to describe the transmission quality of nonlinear

time variant systems like low-bit-rate speech codecs in a satisfactory way.

The goal in objective speech quality measurement is to quantify the quality degradation of a

speech sample relatively to an undegraded reference situation. The application of psychoacoustical

preprocessing models is motivated by the assumption that the signal is transformed to an \internal

representation" of the sound that is reached after auditory preprocessing. This representation is

accessible to higher neuronal stages of perception. It should contain the perceptually relevant

features of the incoming sound. Di�erences in this \internal representation" of input and output

signal are expected to correspond to perceivable di�erences of the two signals and thus indicate a

decreased speech quality of the output signal.

Many alternatives have been proposed to incorporate elements of human perception [1, 2, 3, 4,

5, 6, 7, 8]. In most psychoacoustically motivated speech quality measures adjustable parameters can

be used to maximize the correlation between the objective and subjective speech quality measure.

However, this may result in di�culties in handling every arbitrary kind of signal degradation or in

a restricted applicability of the preprocessing model for psychoacoustical purposes.

The aim of this study is to investigate the necessary properties of a \functional auditory pre-

processing model" capable of measuring the speech quality of arbitrary speech coding systems.

Therefore �ve psychoacoustical preprocessing models of di�erent complexity have been applied to

the measurement of the speech quality of low-bit-rate coded speech sounds. Here the name \pre-

processing model" refers to an algorithm that transforms the input of the human auditory system

to an internal representation, while the \method" used for calculating the objective speech quality

measure is the same with each of the �ve preprocessing models. The method assumes that subjects

are able to compare the quality of a test speech signal with that of an internally stored reference.

From the frequency weighted internal representations of the original and the distorted signal a

physical measure of similarity is calculated as the objective speech quality measure.

Page 105: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

97

2 Method for calculating speech quality

The input and output signal of the device under test are time-aligned and equalated to have the

same overall rms-value prior to the computing of the \internal representations" of each of both

signals.

The �rst stage of all �ve preprocessing models then consists of a gammatone band pass �lterbank[9].

The signal is split up into 19 band pass signals with center frequencies from 300 to 4000 Hz, equally

distributed with 1 �lter per ERB (equivalent rectangular bandwidth). The �ve preprocessing mod-

els di�er in the subsequent nonlinear compression algorithm and are described in the following

section. The general method for calculating the speech quality is depicted in �gure 1

basilar−membrane filtering

compression / adaptationunit

block averageing / low−pass filter

frequency band weighting

distorted signal

3) hair cell modell

hair cell modell (Fischer & Verhey)20 ms−average

4) perception model

halfwave rectificationlow−pass 1 kHzadaptation (Pueschel)low−pass t=20ms20 ms−average

2) logarithmic compression

halfwave rectificationlogarithmlow−pass t=20 ms20 ms−average

5) modulation analysis

perception model (4)modulation filterbank20 ms−average

1) BSD

absolute square20 ms−averagepower law x 0.23

quality measure

linear correlation coefficient

originalsignal

"internal representation"

least−square fittransformation MOS

optional stage: modulation filterbank

Figure 1: Speech quality calculation method (except for preprocessing model with modulation

�lterbank analysis)

Page 106: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

98 Appendix A: Reprint of ...

In most speech quality measures in literature that use a transformation of the input and output

signal to a time-frequency representation the actual speech quality measure is calculated by means

of a subtraction of the two representations and an integration or averaging along the time and

frequency axes.

In contrast to those methods, in this approach the quality measure Q is calculated as the

linear correlation coe�cient of corresponding samples of the two representations. The correlation

coe�cient is a measure for the similarity of the representations with a maximum value of 1.0 that

corresponds to a transparent codec device.

3 Implemented preprocessing models

In all �ve preprocessing models the stages subsequent to the initial band pass �lterbank model

features of amplitude compression that takes places on the hair cells and the auditory nerve �bers.

In the following the preprocessing models are referred to by capital letters as described below:

B: adapted Bark Spectral Distortion (BSD) [8]. This model calculates the mean energy in

20 ms frames. The energy is compressed by a power law y = x0:23 in order to account for

loudness perception of steady state sounds.

L: instantaneous Logarithmic compression. The samples are halfway recti�ed (with a thresh-

old of 10�5) and the logarithm of the momentaneous value is calculated y = log(max(10�5; x))

. A low pass �lter of 8 Hz limits temporal resolution. The average in 20 ms frames is calcu-

lated.

H: Hair cell population model Verhey and Fischer's hair cell model [10] is used to calculate

the �ring rate of a hair cell population. The �ring rate is averaged in 20 ms frames.

P: \Perception model" This model has also successfully been applied to simulate psychoacous-

tical threshold data of a variety of experiments [11]. The samples are halfway recti�ed and

low-pass �ltered at 1 kHz. Adaptation loops [12] model temporal masking e�ects and non-

linear amplitude compression. A low pass �lter at 8 Hz limits the temporal resolution of the

output. The �nal stage is an averaging in 20 ms frames.

M: \Modulation analysis model" This model is an extension of the perception model above,

designed to simulate also modulation threshold data [13]. Instead of using the 8 Hz low pass

�lter as before the Hilbert envelope is calculated and analyzed in 10 bands by a modulation

�lter bank with center frequencies from 5 to 1000 Hz. The output is averaged in 20 ms frames.

Finally, subsequent to each preprocessing model a frequency band weighting is applied. The

frequency bands are weighted individually according to the 40 Phon isophone of the equal loudness

functions, as described in ISO 226. The weighting can be considered as a part of the calculation of

the quality measure Q.

All processing algorithms are implemented using the signal processing software \si++" running

on a Unix Workstation. The calculation of the quality measure QP for one sentence pair of 8 s

duration takes 22 times realtime on an SGI Indy.

Page 107: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

99

4 Results with the ETSI Halfrate Selection Test

Figures 2 - 6 give the results of the speech quality measurement for the �ve di�erent preprocessing

models.

The subjectively measured speech quality (MOS) of four test sentences processed under the

same codec conditions is plotted as a function of the objective speech quality measure QX . In the

upper left corner of each panel the Spearman rank correlation coe�cient rs, the linear correlation

coe�cient r and the standard deviation SD of a second order polynomial least square �t is shown.

The numbers 1 - 6 at the data points refers to the codec numbers in the Test. The data for the

Modulated Noise Reference Unit (MNRU) are plotted by an m.

The correlation coe�cients and standard deviations of the estimate in MOS for a polynomial

least square �t are given in the table in �gure 7.

1

1.5

2

2.5

3

3.5

4

4.5

5

0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1

MO

S

QB

adapted Bark Spectral Distortion

1

11

1

1

1

2 22

2

2

2

3 3

3

3 3

3

4 4

4

4

4

4

5

555

5

5

66

6

6

6

6

m m

m

m

m

m

mm

rs: 0.842

r: 0.845

SD: 0.370

Figure 2: MOS versus quality measure QB

1

1.5

2

2.5

3

3.5

4

4.5

5

0.998 0.9985 0.999 0.9995 1

MO

S

QL

Instantaneous logarithmic compression

1

11

1

1

1

2 22

2

2

2

33

3

33

3

44

4

4

4

4

5

555

5

5

66

6

6

6

6

m m

m

m

m

m

mm

rs: 0.866

r: 0.777

SD: 0.463

Figure 3: MOS versus quality measure QL

1

1.5

2

2.5

3

3.5

4

4.5

5

0.95 0.96 0.97 0.98 0.99 1

MO

S

QH

Hair cell model

1

11

1

1

1

222

2

2

2

33

3

33

3

44

4

4

4

4

5

555

5

5

66

6

6

6

6

m m

m

m

m

m

mm

rs: 0.759

r: 0.728

SD: 0.505

Figure 4: MOS versus quality measure QH

1

1.5

2

2.5

3

3.5

4

4.5

5

0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

MO

S

QP

Perception model

1

11

1

1

1

222

2

2

2

33

3

3 3

3

4 4

4

4

4

4

5

555

5

5

66

6

6

6

6

m m

m

m

m

m

mm

rs: 0.924

r: 0.920

SD: 0.279

rs: 0.924

r: 0.920

SD: 0.279

Figure 5: MOS versus quality measure QP

Page 108: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

100 Appendix A: Reprint of ...

1

1.5

2

2.5

3

3.5

4

4.5

5

0.88 0.9 0.92 0.94 0.96 0.98 1

MO

S

QM

Modulation analysis model

1

11

1

1

1

222

2

2

2

33

3

3 3

3

4 4

4

4

4

4

5

5555

5

66

6

6

6

6

m m

m

m

m

m

m m

rs: 0.815

r: 0.811

SD: 0.422

Figure 6: MOS versus quality measure QM

preprocessing r rs SD

BSD 0.845 0.842 0.370Logarithm 0.777 0.866 0.463Hair cell 0.728 0.759 0.505Perception 0.920 0.924 0.279

Modulation 0.811 0.815 0.422

Figure 7: Correlation coe�cient r, rank corre-lation coe�cient rs and standard deviation of asecond order least mean square �t for objectivequality measure and subjective MOS

5 Discussion

The de�nition of Q as the correlation coe�cient implies a nonlinear relation to MOS data. Thusthe Spearman Rank correlation coe�cient rs is the best measure of the predictive ability of theobjective quality measure Q.

A Comparison across the �ve versions shows the following results: The objective quality measureQ covers only a small range of the \objective scale" from 0 to 1 for all models. With certain models(BSD and hair cell) the objective quality measure for the MNRUs deviate substantially from theother codecs in the test. The perception model without modulation �lterbank yields the bestresults.

Obviously, the type of distortion introduced by di�erent codec schemes di�er considerably,leading to di�erent \clusters" within the MOS / Q data for di�erent codecs for certain preprocessingmodels. Only if all di�erences between input and output signal are evaluated in the perceptuallycorrect way no deviation between \clusters" should occur.

The �ve preprocessing models di�er in the way amplitude compression is performed. Tworeasons can be considered why the perception model yields the best results in measuring speechquality compared to the other models: The adaptation loops in the perception model account fora realistic dynamic compression y = 32

px � log(x) for static signals. They also account for realistic

temporal properties, including forward/backward/simultaneous masking and gap detection.The modulation �lterbank model would be expected to provide even better results then the

perception model because the former models test tone integration and amplitude modulation per-ception in more realistic way. However, the de�nition of Q as the correlation coe�cient betweencorresponding samples of the two representations may not yet be optimal in the modulation fre-quency domain, as the modulation frequency channels have not yet been weighted according toperceptive relevance. This might have caused poorer performance.

References

[1] M. Hansen and B. Kollmeier. \Anwendbarkeit eines psychoakustisch motivierten Sprachvorverar-beitungsmodells f�ur die Sprachqualit�atsmodellierung". In Elektronische Sprachsignalverarbeitung, pages

Page 109: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

101

34{39. ITA Dresden, 1995.

[2] J. G. Beerends and J. A. Stemerdink. \A Perceptual Speech Quality Measure based on a Psychoacoustic

Sound Perception". J. Audio Eng. Soc., 42 (3):115{123, 1994.

[3] J. G. Beerends and J. A. Stemerdink. \Modelling a Cognitive Aspect in the Measurement of the Quality

of Music Codecs". In 96th AES Convention Amsterdam, New York, 1994. (Preprint).

[4] J. Berger and A. Merkel. \Psychoakustisch motivierte Einzelma�e als Ansatz zur objektiven

Qualit�atsbestimmung von ausgew�ahlten Sprachcodiersystemen". In Elektronische Sprachsignalverar-

beitung, Proceedings, TU-Berlin, 1994.

[5] ETSI Technical Report. Voice Transmission Quality from Mouth to Ear of 3.1 kHz handset telephony

across networks. ETSI Secretariat, Sophia Antipolis, France, 1994. preliminary Version 0.3.

[6] S.R. Quackenbush, T.P. Barnwell, and M.A. Clements. Objecive Measures for Speech Quality. Prentice

Hall Inc., Englewood Cli�s, USA, 1988.

[7] M.R. Schroeder, B.S. Atal, and J.L. Hall. \Objective Measure of Certain Speech Signal Degradations

Based on Masking Properties of Human Auditory Perception". In Frontiers of Speech Communication

Research, pages 217{229, London, 1979. Academic Press. edited by Lindblom, B. and �Ohlmann, S.

[8] S. Wang, A. Sekey, and A. Gersho. \Auditory Distortion Measure for Speech Coding". In IEEE Proc.

Int. Conf. Acoust., Speech Signal Processing, pages 493{496, 1991.

[9] R. Patterson, Nimmo-Smith I. J. Holdsworth, and P. Rice. \An e�cient auditory �lterbank based on

the gammatone function". 1987. Paper presented at a meeting of the IOC Speech Group on Auditory

Modelling at RSRE.

[10] J. Verhey and Fischer K.A._\Ein einfaches Dynamikkompressionsmodell, das durch zwei Zeitkonstanten

charakterisiert ist". In Fortschritte der Akustik DAGA '94, pages 1077{1080, Bad Honnef, 1994. DPG-

Kongress GmbH.

[11] T. Dau. \Der optimale Detektor in einem Computermodell zur Simulation von psychoakustischen

Experimenten". Master's thesis, Universit�at G�ottingen, 1992.

[12] D. P�uschel. \Prinzipien der zeitlichen Analyse beim H�oren". PhD thesis, Universit�at G�ottingen, 1988.

[13] T. Dau. \Modeling auditory processing of amplitude modulation". PhD thesis, Universit�at Oldenburg,

1996.

This work has been supported by the research center of Deutsche Telekom AG.

Page 110: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Appendix B

The relation of qC and thelikelihood ratio l

In this appendix, the threshold modeling approach desribed in chapter 3,section 3.4.1 is compared to the approach presented by Dau (1996), Dau andPuschel (1993), and Dau et al. (1996a).

The first difference of the two approaches is related to the internal repre-sentations of the signals. In both cases the representation is obtained with thesame preprocessing model. However, in the modeling approach based on qC ,the representation is downsampled by a factor of 320 prior to the calculationof the correlation coefficient between two corresponding representations. Thedownsampling is performed by the temporal averaging across 20 ms segmentsof the representation (cf. chapter 2, section 2.3.4). In Dau’s simulations therepresentation is not downsampled at all.

Additionally, in many psychoacoustical experiments, the test signal tobe detected in the presence of a masker signal, does not necessarily havethe same duration as the masker. The two stimuli do not even need to bepresent simultaneously within the test interval. In contrast, in the experi-ments on speech quality assessment and detection of speech modulated noisedistortions described in this study, the distortion was always simultaneousand had the same duration as the speech stimulus. This difference may havean influence on the choice of the target template (see below).

The approach to model thresholds based on a constant value of qC wasillustrated by the computation of qC at equidistantly spaced values of Q foreach parameter of fc. Intermediate values of qC(Q, fc) were interpolated.This method can be directly replaced by Dau’s approach to run through anadaptive procedure that converges to the test parameter at threshold.

Concerning the modeling of the decision problem, the following reasoningis mainly reproduced from Dau (1996) and Dau et al. (1996a), with only

102

Page 111: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

103

minor alteration that applies to the terminology of the current experiments.It can also be found in standard signal detection theory (Green and Swets,1966).

In an n-IFC method a decision has to be made which of the n intervalscontains the signal with the distortion noise. For simplicity n = 2 is assumedhere. The two intervals presented to the subject produce the auditory eventse1(t) and e2(t), which are the internal representations of the incoming signalwaveforms x1(t) and x2(t). One of them contains the reference situation(speech stimulus alone, s(t)), the other is the test situation (speech stimulus+ noise distortion, s(t) +n(t)). Without loss of generality it can be assumedthat x1(t) = s(t) and x2(t) = s(t) + n(t).

x1 and x2 are known and fixed. However, e1 and e2 will be affectedby a certain variability due to internal noise which is always present in theauditory system. The assumption of an internal noise is necessary for theformal modeling of the decision problem. It was not included in the originaldevelopment of qC as a speech quality measure for supra-threshold distor-tions. The variance of this noise is subject to a calibration of the model fromexperimental data.

To solve the decision problem, the two received auditory events, e1 ande2 are compared with a known, fixed target situation. In our case, the rep-resentation of the speech signal alone, eref(t), is taken as this known targetsituation. It is often also called a “template”.

The comparison is carried out by calculating the correlation coefficientr ≡ qC which is the normalized cross correlation of the two representations,in this case qC(e1, eref) and qC(e2, eref).

Because the preprocessing algorithm does not really add an internalnoise, interval 1 which contains the reference signal alone will always pro-duce qC(e1, eref) ≡ 1, because e1 and eref are identical. The other interval,x2(t), will produce qC(e2, eref) < 1, because the additional distortion noisereduces the correlation.

Of course, not any arbitrary deviation of qC(e2, eref) from 1 will corre-spond to a detectable difference between x1(t) and x2(t). Only if the devi-ation exceeds a certain criterion, 1 − qC(e2, eref) > c, the decision will bemade, that the interval x2 contains a distortion noise n(t) that is audible inthe presence of s(t). The value of the criterion c depends on the variance ofthe internal noise that was assumed to affect the auditory events, i.e., theinternal representations.

On the assumption that the auditory events, e1 and e2, have a Gaussiandistribution with identical variance at each point of time and a difference inthe mean that corresponds to the extra noise signal, an analytical expressionfor the decision rule can be deduced that is based on a likelihood argument.

Page 112: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

104 Appendix B: The relation of qC and the likelihood ratio l

Apart from a constant additive term, the log of the decision variable is thecross correlation between the received signal and the expected signal. Thiscross correlation is the mathematically optimal decision variable in case ofthe above assumptions.

In their studies, Dau et al. compute the cross correlation between thenormalized template and the received signal. If the cross correlation exceedsa certain criterion, it is decided that the actual signal contains the additionaltest signal. This criterion had a fixed numerical value for all experimentsmodeled so far. Similar to this, qC is normalized by the definition of thecorrelation coefficient. Also here, qC is a fixed value at threshold. Thedifference in the two approaches corresponds to merely different calibrationsof the decision criteria.

Another difference concerns the choice of the template. As already statedabove, in the current approach based on qC , the representation of the ref-erence situation is chosen as the template. It is correlated with the rep-resentation of the actual test sentence. In contrast to this, Dau choosesa supra-threshold realization of the test signal situation as the template.This realization of the test signal is generated indirectly by subtracting therepresentation of the reference interval from the representation of a supra-threshold test signal situation. This template is normalized and correlatedwith the representation of the actual test interval.

In sum,the two approaches are structurally very similar to each otherexcept for the choice of the template, and different normalization proceduresleading to an absolute shift of the decision criterion.

Page 113: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Appendix C

Calculation of thetime-dependent speech qualitymeasures

This appendix gives the formal description how the time-dependentspeech quality measures qC(t) (based on the correlation coefficient) and qS(t)(based on the mean squared difference) are calculated from the internal rep-resentations of the reference signal and test signal.

Prior to the downsampling, the internal representation Xi,j is sampledat times t = i∆t, i = 1 . . .N and frequencies f = (f0 + j∆f), j = 1 . . .M .By averaging across frames of duration τ = L∆t, Xi,j is transformed to X ′l,jaccording to

X ′l,j =1

L

L∑i=1

X(l−1)·L+i,j , l = 1 . . .N/L. (C.1)

From the representations X ′l,j and Y ′l,j of the reference and test signal, re-spectively, the continuous-time speech quality measure was calculated in twodifferent ways that correspond to the two definitions of the mean squareddifference measure qS and the correlation qC described in chapter 2, sec-tions 2.3.4 and 2.3.4. However, no band-weighting was applied to the repre-sentations for the continuous-time speech quality measure.

At times t = lτ , the mean squared difference measure qS(l) is defined as

qS(l) =1

M

M∑j=1

(X ′l,j − Y ′l,j)2 (C.2)

105

Page 114: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

106Appendix C: Calculation of the time-dependent speech quality measures

and the correlation measure qC(l) by

qC(l) =

M∑j=1

(X ′l,j −X ′l) · (Y ′l,j − Y ′l )√∑j

(X ′l,j −X ′l)2√∑

j(Y ′l,j − Y ′l )

2, (C.3)

where X ′l and Y ′l denote the means of the X ′l,j and Y ′l,j across the frequencyindex j. The above equations are similar to the definitions of qS and qC . Thedifference lies in a missing summation along the time-index i. Note, thatboth the mean-square summation and the correlation operation is performedacross the two corresponding sets of M = 19 spectral values at one point oftime.

Yet another way to implement a time-dependent frame-wise objectivespeech quality measure is to calculate the measure not only across the centerfrequency dimension, but across a local two-dimensional window of the twointernal representations X ′l,j and Y ′l,j . Instead of, e.g., applying a temporalwindow of τ=100 ms duration for downsampling by averaging, five consecu-tive frames with τ=20 ms length each may be combined to yield a short-timetwo-dimensional representation. In general this (rectangular) window willhave a length of K (temporal) sample.

This results in the two respective measures

q∗S(l) =1

M ·KM∑j=1

K−1∑k=0

(X ′l+k,j − Y ′l+k,j)2, l = 1 . . .N

KL. (C.4)

and

q∗C(l) =

M∑j=1

K−1∑k=0

(X ′l+k,j −X ′∗l ) · (Y ′l+k,j − Y ′∗l )√M∑j=1

K−1∑k=0

(X ′l+k,j −X ′∗l )2

√M∑j=1

K−1∑k=0

(Y ′l+k,j − Y ′∗l )2, l = 1 . . .

N

KL,

(C.5)where the X ′∗l and Y ′∗l are correspondingly defined as the means of the

two-dimensional windowed representation across the frequency index j andframe-number index k. Note, that for these time-varying speech qualitymeasures the averaging and correlation operation are performed across thecorresponding K · 19 values, and at times t = (l ·K)τ .

Both time-dependent distance measures were mapped to the r(t) scaleby means of a simple transformation and subsequently low-pass filtered at

Page 115: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

107

0.5 Hz. For qS(l) an exponential transformation to the r(t) quality-scale ofthe form

rS(qS(t)) = r0eλS ·qS(t) (C.6)

was fitted.For qC(t) also an exponential transformation, of the form

rC(qC(t)) = 10 + 80 · eλC ·(qC(t)−q0) (C.7)

was fitted. The parameters resulting from a Nelder-Mead type simplexsearch optimization were r0 = 100 and λS = −0.078, and q0 = 1 andλC = 16.30, in the case of τ=20 ms.

The model predictions by rqC(t) and rqS(t) for a frame duration ofτ =100 ms are shown in Appendix C in Fig. C.1 and Fig. C.2, respectively.

The results with the short-time two-dimensionally calculated measuresq∗C(t) and q∗S(t) over 5 frames of 20 ms duration are displayed in Fig. C.3 andFig. C.4.

Page 116: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

108Appendix C: Calculation of the time-dependent speech quality measures

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 1, tau=100ms

subj.rqC(t)target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus B, profile 1, tau=100ms

subj.rqC(t)target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 2, tau=100ms

subj.rqC(t)target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

time [s]

stimulus B, profile 2, tau=100ms

subj.rqC(t)target

Figure C.1: Continuous-time speech quality prediction by the objective cor-relation speech quality measure rqC(t) for a frame duration of 100 ms.

Page 117: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

109

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 1, tau=100ms

subj.rqS(t)

target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus B, profile 1, tau=100ms

subj.rqS(t)

target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 2, tau=100ms

subj.rqS(t)

target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

time [s]

stimulus B, profile 2, tau=100ms

subj.rqS(t)

target

Figure C.2: Continuous-time speech quality prediction by the objective meansquared difference speech quality measure rqS(t) for a frame duration of100 ms.

Page 118: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

110Appendix C: Calculation of the time-dependent speech quality measures

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 1, tau=5*20ms

subj.rqC(t)target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus B, profile 1, tau=5*20ms

subj.rqC(t)target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 2, tau=5*20ms

subj.rqC(t)target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

time [s]

stimulus B, profile 2, tau=5*20ms

subj.rqC(t)target

Figure C.3: Continuous-time speech quality prediction by the objective cor-relation speech quality measure qC(t) for a frame duration of 5 times 20 ms.

Page 119: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

111

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 1, tau=5*20ms

subj.rqS(t)

target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus B, profile 1, tau=5*20ms

subj.rqS(t)

target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

stimulus A, profile 2, tau=5*20ms

subj.rqS(t)

target

020406080

100

0 5 10 15 20 25 30 35 40

slid

er p

ositi

on [m

m]

time [s]

stimulus B, profile 2, tau=5*20ms

subj.rqS(t)

target

Figure C.4: Continuous-time speech quality prediction by the objective meansquared difference speech quality measure qS(t) for a frame duration of 5times 20 ms.

Page 120: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Appendix D

Application of qC for evaluatinga hardware implementation ofthe preprocessing model

The results of this thesis show, that the preprocessing model employed inthe measure qC yields a representation of the perceptually relevant informa-tion contained in a sound signal. The preprocessing accounts for a realisticmodeling of the signal transformation in the hearing process, which has beendemonstrated by the successful prediction of speech quality MOS data ofvarious test data bases. This feature can be exploited in the reverse fashion:

A modification of the preprocessing model can be evaluated in terms of itsoptimality of the resulting signal representation, by applying the preprocess-ing algorithm to the speech quality prediction in the way described in chap-ter 2. The hardware implementation of an auditory signal preprocessing rep-resents such a modification of the original preprocessing algorithm, becausea fixpoint implementation alters the numerical precision of the computation.For the Oldenburg preprocessing model, a VLSI-chip is currently being de-veloped. In order to reduce the chip-surface (i.e., the production costs) thewordlengths of the internal number quantization has to be reduced, whichdecreases the calculation precision. Especially a numerical analysis of thewordlenghts within the nonlinear feedback loops of the preprocessing model(containing a division element) is a crucial point here, because standard SNRconsiderations for quantization effects can not be applied.

Therefore, a precise model of the hardware algorithm with limited fix-point was implemented. It was employed for the calculation of the internalrepresentation with the objective speech quality measure qC . The resultingmeasure, qfix

C was used to calculate the objective speech quality for the ETSIHalfrate and the ITU 8kbit test data bases. In this way, the necessary quan-

112

Page 121: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

113

tization widths could be obtained, by iteratively selecting the wordlenghts ofthe internal number representation, until a satisfactory performance of theresulting speech quality measure qfix

C was reached regarding the prediction ofthe speech quality data bases (Brucke et al., 1998).

Page 122: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Bibliography

ANSI. Speech Intelligibility Index (SII). S 3.79, American National StandardsInstitute.

Baillard, P., B. Mabilleau, S. Morisette and J. Soumagne (1992). PERCE-VAL: Perceptual Evaluation of the Quality of Speech Coders. J. AudioEng. Soc., 40(1):21–31.

Beerends, J. G. (1995), Measuring the Quality of Speech and Music Codecs,an Integrated Psychoacoustic Approach. In 98th AES Convention Paris,New York. (Preprint).

Beerends, J. G. and J. A. Stemerdink (1994a), Modeling a Cognitive As-pect in the Measurement of the Quality of Music Codecs. In 96th AESConvention Amsterdam, New York. (Preprint).

Beerends, J. G. and J. A. Stemerdink (1994b). A Perceptual Speech QualityMeasure based on a Psychoacoustic Sound Perception. J. Audio Eng.Soc., 42(3):115–123.

Berger, J. (1996), Ein Ansatz zur instrumentellen Sprachqualitatsabschat-zung in Telefonverbindungen im Festnetz der Deutschen Telekom. InWorkshop on Quality Assessment in Speech, Audio and Image Commu-nication, Proceedings, 17–24, Darmstadt. ITG/EURASIP.

Berger, J. and A. Merkel (1994a), An experimental system for objectivespeech quality measurement. In Proc. Workshop “Speech Quality As-sessment”, 12–14, Ruhr-Uni Bochum.

Berger, J. and A. Merkel (1994b), Psychoakustisch motivierte Einzel-maße als Ansatz zur objektiven Qualitatsbestimmung von ausgewahltenSprachcodiersystemen. In Proc. Elektronische Sprachsignalverarbeitung,TU-Berlin.

114

Page 123: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

BIBLIOGRAPHY 115

Brucke, M., W. Nebel, A. Schwarz, B. Mertsching, M. Hansen andB. Kollmeier (1998), Digital VLSI-implementation of a Psychoacous-tically and Physiologically Motivated Speech Preprocessor. In S. Green-berg (Ed.), Proc. Computational Hearing, Il Ciocco. ASI (AdvancedStudy Institue). (in print).

CCITT (1989). Modulated Noise Reference Unit (MNRU). Blue Book Vol.V Rec. P.81.

Chatfield, C. (1983). Statistics for Technology (3rd ed.). Chapman and Hall,London.

Colomes, C., M. Lever and Y.F. Dehery (1994), A Perceptual ObjectiveMeasurement System (POM) for the Quality Assessment of PerceptualCodecs. In Proc. 96th AES Convention, Amsterdam. preprint 3801.

Dau, T. (1992). Der optimale Detektor in einem Computermodell zur Simu-lation von psychoakustischen Experimenten. Master’s thesis, UniversitatGottingen.

Dau, T. (1996). Modeling auditory processing of amplitude modulation.Ph.D. thesis, Universitat Oldenburg.

Dau, T. and D. Puschel (1993), A quantitative model of the effective sig-nal processing in the auditory system. In Contributions to PsychologicalAcoustics, 107–120, BIS Universitat Oldenburg. Sixth Oldenburg Sym-posium On Psychological Acoustics, edited by A. Schick.

Dau, T., D. Puschel and A. Kohlrausch (1996a). A quantitative model of the‘effective’ signal processing in the auditory system: I. Model structure.J. Acoust. Soc. Am., 99:3615–3622.

Dau, T., D. Puschel and A. Kohlrausch (1996b). A quantitative model ofthe ‘effective’ signal processing in the auditory system: II. Simulationsand measurements. J. Acoust. Soc. Am., 99:3623–3631.

de Ridder, H. and R. Hamberg (1997). Continuous Assessment of ImageQuality. SMPTE Journal, 106(2):123–128.

Dimolitsas, S. (1993). Subjective assessment methods for the measurementof digital speech quality. In B. Atal, V. Cuperman and A. Gersho (Eds.),Speech and Audio Coding for Wireless and Network Applications, 43–54.Kluver Academic Publishers, Boston.

Page 124: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

116 BIBLIOGRAPHY

Eilemann, A. (1994). Mithorschwellen und Trennschwellen harmonischerTonkomplexe. Ph.D. thesis, Universitat Gottingen.

ETSI, TM/TM5/TCH-HS (1991). Global analysis of selection tests: Basicdata. Technical Report 91/74, ETSI.

ETSI, TM/TM5/TCH-HS (1992). Selection Test Phase II: Listening testresults with German speech samples. Technical Report 92/35, FI/DBP-Telekom. Experiment 1, IM 4.

Fassel, R. (1994). Experimente und Simulationsrechnungen zurWahrnehmung von Amplitudenmodulationen im menschlichen Gehor.Ph.D. thesis, Universitat Gottingen.

Fastl, H. (1991), Evaluation and Measurement of perceived average loud-ness. In A. Schick, J. Hellbruck and R. Weber (Eds.), Contributions toPsychological Acoustics, Vol. V, 205–216, Oldenburg. BIS.

French, N.R. and J.C. Steinberg (1947). Factors Governing the Intelligibilityof Speech Sounds. J. Acoust. Soc. Am., 19:90–119.

Gabriel, Birgitta (1996). Equal Loudness Level Contours: Procedures, Fac-tors and Models. Ph.D. thesis, Universitat Oldenburg. ISBN 3-8265-2049-1, Shaker Verlag, Aachen.

Green, D.M. and J.A. Swets (1966). Signal Detection Theory and Psy-chophysics. Wiley, New York.

Hamberg, R. and H. de Ridder (1995). Continuous assessment of perceptualimage quality. J. Opt. Soc. Am. A, 12(12):2573–2577.

Hamberg, R. and H. de Ridder (1997). Time-varying Image Quality: Mod-eling the Relation between Instantaneous and Overall Quality. IEEEtransactions on Systems, Man, and Cybernetics, part A. (To appear).Current version: IPO manuscript no. 1234.

Hansen, M. and B. Kollmeier (1996a), Implementation of a psychoacousticalpreprocessing model for sound quality measurement. In Tutorial andWorkshop on the Auditory Basis of Speech Perception, 79–82, Keele.ESCA.

Hansen, M. and B. Kollmeier (1996b), Prediction of Speech Quality basedon Psychoacoustical Preprocessing Measures. In Workshop on QualityAssessment in Speech, Audio and Image Communication, Proceedings,7–12, Darmstadt. ITG/EURASIP.

Page 125: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

BIBLIOGRAPHY 117

Hansen, M. and B. Kollmeier (1997a), On the relative importance of individ-ual critical bands for the perception of speech quality. In A. Schick andM. Klatte (Eds.), Contributions to Psychological Acoustics, Vol. VII,611–618, Oldenburg. BIS.

Hansen, M. and B. Kollmeier (1997b), Using a quantitative PsychoacousticalSignal Representation for Objective Speech Quality Measurement. InProc. ICASSP ’97, 1387–1390, Munich. IEEE.

Hansen, M. and B. Kollmeier (1998a), Continuous Assessment and Model-ing of Speech Transmission Quality. In Proc. ICA/ASA ’98, 229–230,Seattle.

Hansen, M. and B. Kollmeier (1998b), Kontinuierliche Beurteilung vonSprachqualitat: Messung und Modellierung. In Fortschritte der AkustikDAGA ’98, (in print), Zurich. Dega.

Herre, J., E. Eberlein, H. Schott and K. Brandenburg (1992), Advancedaudio measurement system using psychoacoustic properties. In Proc.92nd AES Convention, Vienna. preprint 3321.

Heute, U. (1996), Tutorial: Instrumental Speech-Quality Measures: State,Developments, Questions. In Workshop on Quality Assessment inSpeech, Audio and Image Communication, Proceedings, 1–3, Darmstadt.ITG/EURASIP.

Hohmann, V. and B. Kollmeier (1995). The effect of dynamic compressionon speech intelligibility. J. Acoust. Soc. Am., 97(2):1191–1195.

Hollier, M.P. and M.O. Hawksford (1995), A perception based speech qualityassessment for telecommunications. In Proc. IEE Colloquium on AudioEngineering, London. Digest number 1995/089.

Hollier, M.P., M.O. Hawksford and Guard D.R. (1994), Error-activity anderror entropy as a measure of psychoacoustic significance in the percep-tual domain. In IEE Proc.-Vis. Image Signal Process., Vol. 141:3.

Hollier, M.P., M.O Hawksford and D.R. Guard (1995). Algorithms for As-sessing the Subjectivity of Perceptually Weighted Audible Errors. J.Audio Eng. Soc., 43(12):1041–1045.

Holube, I. (1993). Experimente und Modellvorstellungen zur Psychoakustikund zum Sprachverstehen bei Normal- und Schwerhoringen. Ph.D. thesis,Universitat Gottingen.

Page 126: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

118 BIBLIOGRAPHY

Houtgast, T. (1997). personal communication.

Houtgast, T. and H. J. M. Steeneken (1985). A review of the MTF con-cept in room acoustics and its use for estimating speech intelligibility inauditoria. J. Acoust. Soc. Am., 77(3):1069–1077.

ITU-T (1996a). Methods for subjective determination of transmission qual-ity. Series P: Telephone Transmission Quality, Recommendation P.800,ITU, Geneva.

ITU-T (1996b). Objective quality measurement of telephone-band (300-3400 Hz) speech codecs. Series P: Telephone Transmission Quality, Rec-ommendation P.861, ITU, Geneva.

ITU-T (1996c). Software tools for speech and audio coding standardization.Series G: Transmission systems and media, Recommendation G.191,ITU, Geneva.

ITU-T (1996d). Subjective performance assessment of telephone-band andwideband digital codes. Series P: Telephone Transmission Quality, Rec-ommendation P.830, ITU, Geneva.

ITU-T, Study Group 12 (1994). Correlation between the PSQM and thesubjective Results of ITU-T 8kbit/s 1993 Speech Codec Test. TechnicalReport, ITU, Geneva. Question 13/12 SQEG.

Jekosch, U. (1993), Speech quality assessment and evaluation. In Proc. Eu-rospeech, 1387–1394.

Karjalainen, M. (1983), Objective measurement of distortion in speech sig-nals by computational models of speech perception. In Proc. 11th ICA,141–144, Paris.

Kitawaki, N. (1990). Quality assessment of coded speech. In S. Furui andM. Sondhi (Eds.), Advances in speech signal processing, 357–386. MarcelDekker Inc., New York.

Koch, Rene (1992). Gehorgerechte Schallanalyse zur Vorhersage undVerbesserung der Sprachverstandlichkeit. Ph.D. thesis, UniversitatGottingen.

Kollmeier, B. (1990). Meßmethodik, Modellierung und Verbesserung derVerstandlichkeit von Sprache. Habilitation, Universitat Gottingen.

Page 127: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

BIBLIOGRAPHY 119

Kortekaas, R. (1997). Physiological and psychoacoustical correlates of per-ceiving natural and modified speech. Ph.D. thesis, Technical Universityof Eindhoven.

Kroon, P. (1995). Evaluation of Speech Coders. In W.B. Kleijn and K.K.Paliwal (Eds.), Speech Coding and Synthesis, 467–494. Elsevier ScienceB.V., Amsterdam.

Kryter, K. D. (1962). Methods for the calculation and use of the ArticulationIndex. J. Acoust. Soc. Am., 34:467–477.

Kuwano, S. and H. Fastl (1989), Loudness evaluation of various kinds ofnonsteady state sounds using the method of continuous judgement bycategory. In Proc. 13th ICA, Vol. 3, 365–368.

Levitt, H. (1971). Transformed up-down procedures in psychoacoustics. J.Acoust. Soc. Am., 49:467–477.

Mermelstein, P. (1979). Evaluation of a Segmental SNR Measure as anIndicator of the Quality of ADPCM Coded Speech. J. Acoust. Soc.Am., 66(6):1664–67.

Moore, B. C. J. and B. R. Glasberg (1987). Formulae describing frequencyselectivity as a function of frequency and level and their use in calculatingexcitation patterns. Hear. Res., 28:209–225.

Munkner, S. (1993). Modellentwicklung und Messungen zur Wahrnehmungnichtstationarer akustischer Signale. Ph.D. thesis, UniversitatGottingen.

Namba, S., S. Kuwano and H Fastl (1988), Loudness of road traffic noiseusing the method of continuous judgement by category. In B. Berglund,U. Berglund, I. Karlson and T. Lindvall (Eds.), Noise as a Public HealthProblem, Vol. 3, 241–246, Stockholm. Swedish Council for Building Re-search.

Patterson, R. (1976). Auditory filter shapes derived with noise stimuli. J.Acoust. Soc. Am., 59:640–654.

Patterson, R. and I. Nimmo-Smith (1980). Off-frequency listening and au-ditory filter asymmetry. J. Acoust. Soc. Am., 67:229–245.

Patterson, R. I. Nimmo-Smith, J. Holdsworth and P. Rice (1987), An effi-cient auditory filterbank based on the gammatone function. In AppendixB of SVOS Final Report: The auditory Filterbank. APU report 2341.

Page 128: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

120 BIBLIOGRAPHY

Petersen, K.T., S.D. Hansen and J.A. Sørensen (1997), Speech Quality As-sessment of Compounded Digital Telecommunication Systems; Percep-tual Dimensions. In Proc. ICASSP ’97, 1375–1378, Munich. IEEE.

Pickles, J. O. (1988). An Introduction to the Physiology of Hearing, 2nd Ed.(2nd ed.). Academic Press, London.

Plomp, R. (1988). The negative effect of amplitude compression in multi-channel hearing aids in the light of the modulation transfer function. J.Acoust. Soc. Am., 83:2322–2327.

Preminger, J.E. and D.J. Van Tasell (1995). Quantifying the Relation Be-tween Speech Quality and Speech Intelligibility. J. Speech Hear. Res.,38:714–725.

Press, W., S. Tenholsky, W. Vetterhing and B. Flaunery (1992). NumericalRecipes in C - The Art of Scientific Computing (2nd ed.). CambridgeUniversity Press.

Puschel, D. (1988). Prinzipien der zeitlichen Analyse beim Horen. Ph.D.thesis, Universitat Gottingen.

Quackenbush, S.R. T.P. Barnwell and M.A. Clements (1988). ObjectiveMeasures for Speech Quality. Prentice Hall Inc., Englewood Cliffs, USA.

Sander, A. (1994). Psychoakustische Aspekte der subjektiven Trennbarkeitvon Klangen. Ph.D. thesis, Universitat Oldenburg.

Schroeder, M.R. (1968). Reference Signals for Signal Quality Studies. J.Acoust. Soc. Am., 44(6):1735–36.

Schroeder, M.R., B.S. Atal and J.L. Hall (1979), Objective Measure of Cer-tain Speech Signal Degradations Based on Masking Properties of HumanAuditory Perception. In B. Lindblom and S. Ohlmann (Eds.), Frontiersof Speech Communication Research, 217–229, London. Academic Press.

Sotscheck, J. (1984), Satze fur Sprachgutemessungen und ihre phonologischeAnpassung an die deutsche Sprache. In Fortschritte der Akustik - DAGA’84, 873–876.

Sotscheck, J. (1992). Sprachqualitatstests aus der Nachrichtentechnik.In B. Kollmeier (Ed.), Moderne Verfahren der Sprachaudiometrie,Buchreihe Audiologische Akustik. median–verlag, Heidelberg.

Page 129: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

BIBLIOGRAPHY 121

Steeneken, H.J.M. and T. Houtgast (1980). A Physical Method for Measur-ing Speech-Transmission Quality. J. Acoust. Soc. Am., 67:318–326.

Tchorz, J., K. Kasper, H. Reininger and B. Kollmeier (1997), On the In-terplay between Auditory-Based Features and Locally Recurrent NeuralNetworks. In Proc. Eurospeech ’97, Vol. 4, 2075–2078.

Tchorz, J., M. Wesselkamp and B. Kollmeier (1996), Gehorgerechte Merk-malsextraktion zur robusten Spracherkennung in Storgerauschen. InFortschritte der Akustik DAGA ’96, 532–533. Dega.

Verhey, J. (1998). Personal communication.

Verhey, J. and T. Dau (1997), Modeling comodulation masking release. InA. Schick and M. Klatte (Eds.), Contributions to Psychological Acous-tics, 389–396. BIS Universitat Oldenburg. 7th Oldenburg Symposiumon Psychological Acoustics.

Villchur, E. (1989). Comments on ‘The negative effect of amplitude compres-sion in multichannel hearing aids in the light of the modulation transferfunction’ [J. Acoust. Soc. Am. 83, 2322–2327]. J. Acoust. Soc. Am.,86:425–427.

Voran, S. (1994), Techniques for Comparing Objective and SubjectiveSpeech Quality Tests. In Proc. Workshop “Speech Quality Assessment”,59–64, Ruhr-Uni Bochum.

Wang, S., A. Sekey and A. Gersho (1991), Auditory Distortion Measurefor Speech Coding. In IEEE Proc. Int. Conf. Acoust., Speech SignalProcessing, 493–496.

Weber, R. (1991), The Continuous Loudness Judgement of TemporallyVariable Sounds with an “Analog” Category Procedure. In A. Schick,J. Hellbruck and R. Weber (Eds.), Contributions to Psychological Acous-tics, Vol. V, 267–294, Oldenburg. BIS.

Wesselkamp, M. (1994). Messung und Modellierung der Verstandlichkeit vonSprache. Ph.D. thesis, Universitat Gottingen.

Page 130: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Erklarung

Hiermit erklare ich, daß ich die vorliegende Dissertation selbststandig verfassthabe und nur die angegebenen Hilfsmittel verwendet habe.

Oldenburg, den 15. Mai 1998, Martin Hansen

122

Page 131: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Danksagung

Am Schluß dieser Arbeit mochte ich mich herzlich bei all denen bedanken,die auf verschiedene Weise zu dieser Arbeit beigetragen haben.

Herrn Prof. Dr. Dr. Birger Kollmeier danke ich fur die Ermoglichung dieserArbeit in einer Arbeitsgruppe mit ausgezeichneten Arbeitsbedingungen. Vonseiner stets aufmerksamen Betreuung und seinem Einsatz in der Endphasekonnte ich sehr profitieren.Herrn Prof. Dr. Volker Mellert danke ich fur sein Interesse an dieser Arbeitund fur die Ubernahme des Korreferats.Ein herzliches Dankeschon geht an Dr. Torsten Dau fur die freundschaftlicheund fachliche Motivation zu allen Zeiten und fur seine Art, auch nicht-physikalische Probleme von der richtigen Seite zu sehen.Bei allen Mitgliedern der Arbeitsgruppe “Medizinische Physik” und desGraduiertenkollegs “Psychoakustik” mochte ich mich fur die angenehme Ar-beitsatmosphare bedanken. Auch allen Versuchspersonen sei fur die Teil-nahme an den Experimenten gedankt.Dr. Stefan Uppenkamp, Jesko Verhey , Ralph-Peter Derleth und Dr. BirgittaGabriel mochte ich fur kritische Fragestellungen, Diskussionen, Anmerkun-gen und Korrekturen danken.Jens Berger und Andrea Merkel danke ich fur Anregungen und eine unkom-plizierte Zusammenarbeit vor allem zu Beginn dieser Arbeit.Prof. Dr. Armin Kohlrausch, Dr. Reinier Kortekaas und Dr. Roelof Hambergdanke ich fur wichtige Anregungen zu einigen Experimenten und fur eineangenehme produktive Zeit wahrend eines Aufenthalts am IPO in Eindhoven.Auch bei Dr. Steven van de Par, Dr. Andy Oxenham und Rene van derHorst mochte ich mich fur ihr Interesse und ihre Anregungen zu dieser Arbeitbedanken.Fur anregende Diskussion wahrend Besuchen in Soesterberg und Den Haagmochte ich Prof. Dr. Tammo Houtgast, und Dr. John Beerends danken.

Diese Arbeit wurde finanziell unterstutzt durch das Forschungszentrum derDeutschen Telekom AG sowie durch das von der Deutschen Forschungsgemein-schaft geforderte Projekt “Silicon Cochlea”. Zusatzliche Mittel fur Reisen undAuslandsaufenthalte wurden vom Graduiertenkolleg Psychoakustik und von derSchulenberg Stiftung gewahrt. Diesen Organisationen gilt ebenfalls mein Dank.

123

Page 132: Assessment and prediction of speech transmission quality ... · Speech communication over long distances has become one of the most prominent attributes of our modern culture. In

Lebenslauf

Am 20. September 1967 wurde ich, Martin Hansen, als erstes Kind von WilliHansen und Maike Hansen, geb. Hoffmann, in Flensburg geboren.

In Husum an der Nordsee besuchte ich von 1973 bis 1977 die Grund-schule Klaus-Groth-Schule und von 1977 bis 1986 das Gymnasium Hermann-Tast-Schule. Nach dem Abitur im Mai 1986 nahm ich zum Wintersemester1987/88 das Studium der Physik an der Georg-August-Universitat inGottingen auf.

Im Juli 1989 legte ich dort die Vordiplomprufung in Physik ab. Am Drit-ten Physikalischen Institut in Gottingen begann ich im April 1991 mit meinerDiplomarbeit in der Arbeitsgruppe von Prof. Dr. Manfred R. Schroederuber ein Thema der Raumakustik (Untersuchung einer Methode zur Bestim-mung der Nachhallzeit aus Musiksignalen). Ab dem Wintersemester 1991/92studierte ich Danisch am skandinavistischen Seminar in Gottingen. Im April1993 schloß ich mein Physikstudium mit der Diplomprufung erfolgreich ab.

Nach dem Studium erhielt ich ein Stipendium vom DAAD und vomDanischen Bildungsministerium, mit dem ich von September 1993 bis Mai1994 die Moglichkeit zu einem Gastaufenthalt am “Acoustics Laboratory”der Technischen Universtat Danemarks in Lyngby hatte. Dort beschaftigteich mich bei Prof. Finn Jacobsen und Prof. Torben Poulsen weiter mit Ar-beiten zur Raumakustik sowie zur Sprachakustik.

Seit September 1994 arbeite ich als wissenschaftlicher Mitarbeiter in derArbeitsgruppe Medizinische Physik an der Carl-von-Ossietzky-UniversitatOldenburg. Hier fertigte ich unter Anleitung von Prof. Dr. Dr. BirgerKollmeier die vorliegende Dissertation an. Zu den weiteren Aufgabenin dieser Zeit gehorte die Betreuung einer interdisziplinaren studentischenProjektgruppe (“Entwurf einer integrierten Schaltung zur gehorgerechtenVorverarbeitung akustischer Signale”) und die Arbeit am Projekt “SiliconCochlea”.

124