Download - Subjective Sound Quality Assessment of Mobile Phones for Production Support

Subjective Sound Quality Assessment of Mobile Phones for Production SupportThorsten Drascher, Martin SchultesWorkshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction, 8th and 9th June 2004 - Mainz, Germany

Subjective Audio Quality Assessment, June 2004 © Siemens, 2004

Introduction

The goal of the tests presented in this talk is to ensure customer acceptance of audio quality by statistically approved data.

Customers rate the sum of

Echo cancellation, noise reduction, automatic gain control, …

Contradicting to ancillary conditions of:

Short time (No waste of production capacities)

Low cost

Only limited correlation of objective measurements and subjective sound perception.

Execute subjective audio quality tests before the release for unrestricted serial production

Former results often not reliable due to friendly users and too few tests to guarantee statistical approval

Introduction

Presentation Outline

Test Design Laboratory or in-

situ tests? Laboratory test

design Conversational

task Statistical

Reliability

First Test Presentation Overall Quality Most Annoying

Properties

Discussion & Outlook



Test Design

Laboratory or in-situ tests?

Laboratory test design

Conversational task

Statistical reliability

First Test Presentation

Overall Quality

Most Annoying Properties


Introduction





task Statistical

Reliability


Properties



Test Design

Typical conversation situations for a mobile phone

Single Talk

Double talk

Two different test subject groups

Naive users

Expert Users

Different recommended test methods

Absolute category rating

Comparative category rating

Degraduating category rating

Threshold Method

Quantal-response detectability tests

Introduction





task Statistical

Reliability


Properties



Test Design (ctd.)

Naive user tests will be carried out as single talk and double talk.

Naive user testsAbsolute category rating of

overall quality and collecting most

annoying properties.

Evaluation

Trained user testsComparative category rating

of different parameter setson most annoying properties

(in parallel furtherparameter alteration)

Satisfyingresults?

Introduction





task Statistical

Reliability


Properties


UnrestrictedSerial

production

yes

no


Laboratory or in-situ tests?

in-situ

+ Nothing is more real than reality

+ More interesting for test persons

- Large effort

- Difficult controlling

- Time intensive

Laboratory

+ Good controlling

+ Small effort

+ Reproducible conditions

+ Easy control of environmental conditions

- Some effects have to be neglected

- Psychological influence of laboratory environment on test results

Laboratory tests are much more cost-effective than in-situ tests.

But: How close can reality be rebuilt in laboratories?

There should be at least one comparison between laboratory and in-situ.

Introduction





task Statistical

Reliability


Properties



Laboratory test design

Terminal A: fixed network, hand held, specified,silent office environment(e.g. according toITU-T P.800)

Reproducible playback of previously recorded environmental noises as diffuse sound field

Terminal B: mobile or carkit

under test

Car Noise

Babble NoiseSilence

Single and double talk tests are carried out using different noise levels

Roles within the tests are interchanged

Rating interview with both test subjects

Introduction





task Statistical

Reliability


Properties



Conversational Tasks

Properties of short conversation test scenarios (SCTs)

Typical conversation tasks

Ordering pizza

Booking a flight

Conversation lasts about 2 ½ min

Extended to about 4 min by following interview

SCTs are judged as natural by test subjects

Greeting

Formal structure

caller called person

Enquiry

Question

Precision

Offer

Order

InformationTreating of Order

Discussion of open question

Farewell

[S. Möller, 2000]

Introduction





task Statistical

Reliability


Properties



Statistical Reliability

Moments of interest are the mean and the error of the mean

Error of the mean is a function of the standard deviation

Worst case approximation:

Error of the mean is maximised if supreme and inferior ratings are given with relative frequency of 50%

An error of the mean accounting less than 10 % of the rating interval width is guaranteed after 30 tests

30 tests of 4 min each, resulting in an overall test duration of 2 hours

Tests with 3 different background noises at 3 different levels and in silent environment can be carried out in 40 h (1 week) over 2 different networks

Introduction





task Statistical

Reliability


Properties



First Test Presentation

Internal fair at the beginning of May

Non representative, just “testing the test“

Background: babble noise ~70dB(A)

Terminal under test:

Known to be too silent (not known by test subjects and experimenter)

Development concluded

interview only for the mobile terminal user (19 subjects)

Naive user tests with two questions

What is your opinion of the overall quality of the connection you have just been using?

What were the most annoying properties of the connection you have just been using?

Results given as

Numbers on a scale from 0 to 120

Predefined answers without technical terms (adding new ones was possible)

Introduction





task Statistical

Reliability


Properties



Overall Quality

Numbers invisible for test subjects

Average overall rating: 74 ± 4

(62 ± 3)% of rating interval width

Start value 60 with highest relative frequency

To compare the internal scale with standard MOS ratings, a normalisation is required

Introduction





task Statistical

Reliability


Properties


Bad Poor Fair Good Excellent0 120

TS

Rating

1 38

2 103

3 95

4 60

5 60

6 82

7 81

8 60

9 67

10 72

11 90

12 74

13 103

14 73

15 93

16 38

17 60

18 82

19 78


Overall Quality

MOSc: MOS rating intervals with scale labels in the center

Extreme value 5 rated 5 times (>25 %)

Extreme value 1 never assigned

Average overall rating: 3.8 ± 0.2


Introduction





task Statistical

Reliability


Properties



TS

Rating

MO

Sc

1 38 2

2 103 5

3 95 5

4 60 3

5 60 3

6 82 4

7 81 4

8 60 3

9 67 3

10 72 4

11 90 5

12 74 4

13 103 5

14 73 4

15 93 5

16 38 2

17 60 3

18 82 4

19 78 4

1 2 3 4 5


Overall Quality

MOSl: MOS rating intervals with scale labels at the lower end

Complete range is used

Extreme value 5 rated twice

Average overall rating: 3.3 ± 0.2


Introduction





task Statistical

Reliability


Properties



TS

Rating

MO

Sl

1 38 1

2 103 5

3 95 4

4 60 3

5 60 3

6 82 4

7 81 4

8 60 3

9 67 3

10 72 3

11 90 4

12 74 3

13 103 5

14 73 3

15 93 4

16 38 1

17 60 3

18 82 4

19 78 3

1 2 3 4 5


Most Annoying Properties

My partner‘s voice was too silent

Loud noise during the call

I heard my own voice as echo

My partner‘s voice was reverberant

My partner‘s voice sounded robotic

I heard artificial sounds

*My partner‘s voice sounded modulated

*My partners voice was too deep

I heard my partner‘s voice as echo

My partner‘s voice was too loud

*) Properties added during test

About 50% of test subjects regarded the partner‘s voice as too silent (known before, but not by the subjects and the experimenter)

7 of 8 test subjects regarded the environmental noise as annoying property

Introduction





task Statistical

Reliability


Properties


1

1

1

1

1

1

8

9



A short-time intensive subjective test method and a first test were presented.

After ratings of 19 test subjects

the error of the mean overall quality was assessed to about 3 % of rating interval width

statistical approval of being too silent

Questions and predefined answers have to be chosen very carefully

Scale rating normalisation to MOS is a non trivial problem

Next steps:

Comparison of laboratory and in-situ tests

Tests of terminals and car kits currently in development state.

Introduction





task Statistical

Reliability


Properties