ITU-T Workshop on · "From Speech to Audio: bandwidth extension, binaural perception" Lannion,...

31
Lannion, France, 10-12 September 2008 International Telecommunication Union Towards Consistent Assessment of Audio Quality of Systems with Different Available Bandwidth Yu Jiao Sławek Zieliński Francis Rumsey Institute of Sound Recording, University of Surrey ITU-T Workshop on "From Speech to Audio: bandwidth extension, binaural perception" Lannion, France, 10-12 September 2008

Transcript of ITU-T Workshop on · "From Speech to Audio: bandwidth extension, binaural perception" Lannion,...

Lannion, France, 10-12 September 2008

InternationalTelecommunicationUnion

Towards Consistent Assessment of Audio Qualityof Systems with Different Available Bandwidth

Yu Jiao Sławek Zieliński Francis Rumsey

Institute of Sound Recording, University of Surrey

ITU-T Workshop on"From Speech to Audio: bandwidth extension,

binaural perception" Lannion, France, 10-12 September 2008

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 2

Outline

New challenges in quality assessmentAttributes to assessStandard scaleSummaryDemo I: an example of attributesidentification and assessmentDemo II: assessing envelopment, anexample of direct anchoring

Where should I go?

Audio

Speech

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 4

Universal Method?

Is there a need for the development of a new,more universal standard for audio qualityassessment, regardless of application orbandwidth?

New Challenges in Speech and Audio QualityAssessment

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 5

Towards a Universal Method

Two challenges:

Which perceptual attributes to use?

How to calibrate the scale so that the results

obtained from assessment of audio quality in

different applications can be sensibly

compared?

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 6

Outline

New challenges in quality assessmentAttributes to assessStandard scaleSummaryDemo I: an example of attributesidentification and assessmentDemo II: assessing envelopment, anexample of direct anchoring

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 7

Most commonly used attributes:Speech Quality (ACR, DCR, CCR) – ITU-T

Recommendations

Basic Audio Quality (Continuous scale), Stereophonic

Image Quality, Front Image Quality, Impression of

Surround Quality – ITU-R Recommendations

What about other attributes? (see next slide)

Which Attributes to Use?

Which Attributes to Use?

Preference

Basic Audio Quality

Loudness DynamicsEnvelopment

Localisation error

Timbral AccuracyFront Image Accuracy

Spatial AccuracyJudgements

DistortionsNoise presence

Timbral Quality Spatial Quality

Pref

eren

ces

Width change Distance change

This set is not exhaustive.

?

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 9

It is relatively easy to agree upon and standardise high

level attributes

It is more difficult to standardise low-level attributes

Usefulness of low-level attributes is application specific

Which Attributes to Use?

However, a pool of standardised attributes, togetherwith associated anchors and scale system, may be ofhelp.

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 10

One of the most systematic attribute pools for spatial audio assessment

Example of Spatial Attributes Pool – Rumsey(2002)

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 11

Outline

New challenges in quality assessmentAttributes to assessStandard scaleSummaryDemo I: an example of attributesidentification and assessmentDemo II: assessing envelopment, anexample of direct anchoring

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 12

Do we need a standard audio quality scale?

It may help to reduce bias in listening tests

Essential for calibration of objective models

It may help to compare results across different applications

with different available bandwidth

Rangeequalising

bias

Contractionbias

Centringbias Stimulus

spacingbias

etc.

Poulton (1989)

Zielinski, Rumsey and Bech (2008)

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 13

Range Equalising Bias

e.g. Wide BandStimuli

e.g. Narrow BandStimuli

Range Equalising Bias = “Rubber Ruler Effect”

e.g. NarrowBand Stimuli

e.g. Wide BandStimuli

e.g. Full BandStimuli

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 15

Range Equalising Bias – Data taken from Zielinski etal (2003 and 2005)

• Means and95% CIs

• Systematicupward shift

• Max diff ≈ 20

• Absolutescores?

Conclusion: Do not put confidence in the labels

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 16

Range Equalising Bias – Zielinski et al (2007)

• Means and95% CIs

• Systematicupward shift

• Max diff = 13

• Absolutescores?

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 17

Range Equalising Bias – Cheer (2008)

Test based on ITU-TP.800

Questions:

• Is “AbsoluteCategory Rating”(ACR) reallyabsolute?

• Do we need abetter calibratedscale?

Excellent

Good

Fair

Poor

Bad

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 18

Towards Consistent Assessment across Narrow-,Wide- and Full-Band Applications

Use a standard scale:If possible, use physical units that are familiar to the listeners

(Poulton, 1989)

• For example, width of frontal image can be assessedas an angle expressed in degrees

• Distance between listener and apparent source couldbe assessed in metres

• Use an open-ended ratio scale – biased (Narens, 1996)

• Use verbal anchors along the scale (ineffective – see theprevious slide)

• Use auditory anchors (effective but difficult to implement)

Three Types of Auditory Anchors:

1. Direct Anchors 2. Indirect Anchors 3. Background Anchors

Listeners are instructedhow to use the scalerelative to two or moreauditory anchors

Anchors are included in theset of stimuli underassessment. Listeners arenot instructed how toassess them. They areunaware of their purpose.

Anchors are only presentedduring familiarisation phaseprior to a listening test butthey are not included in theproper listening test. Listenersare not instructed as to howthese anchors relate to thescale.

• Help to define a“frame of reference”

• Examples of partialuse: ITU-R BS.1116and ITU-R BS.1534MUSHRA

• Effective bias diagnostictool

• May help to define a“frame of reference” ifused properly

• Examples of use: ITU-TP.800 (MNRU referencequality impairments). Also3.5kHz anchor in ITU-RBS.1534 MUSHRA

• They do not calibrate thescale but “calibrate”listeners

• Used very rarely

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 20

Example of a Scale with Direct Anchors (A & B) – Conetta (2008)

Only two anchors in this example but more than 2 can also be used.

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 21

• How to choose them in terms of levels of quality?

• How to make them similar to stimuli under assessment in termsof perceptual properties?

Challenges of Direct Anchoring

HypotheticalStandardQuality Scale

Anchor 1: Full bandwidth full surroundsound quality?

Anchor 5: Cascaded codecs? Drop-outs? Modulation Noise? Hardclipping? etc.

Anchor 3: FM radio quality?

Anchor 4: Narrow-bandwidth telephonequality?

Anchor 2: CD quality?

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 22

Example of Diagnostics with Indirect Anchors

• Indirect anchors – usefuldiagnostic tool to check forbias

• Scores “float” along the scale

• Do not put confidence in labels

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 23

Outline

New challenges in quality assessmentAttributes to assessStandard scaleSummaryDemo I: an example of attributesidentification and assessmentDemo II: assessing envelopment, anexample of direct anchoring

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 24

• A need for the development of a new, more universal standard foraudio quality assessment, regardless of application or bandwidth.

• More attributes are needed to reveal the nature of qualitydegradation.

• Comparing audio quality across different applications, e.g. withdifferent audio bandwidth, is problematic due to potentiallyinconsistent use of a scale (ill-defined frame of reference).

• Standard scale needed.

• Direct anchoring technique could be used but difficult to identifysuitable auditory anchors.

Summary

Scales & test method design

Demo I: An example of attributes identification & assessment -Jiao etal. (2007)

Basic Audio Quality

Timbral Distortion

Dynamicity of DSDLevel of DSD

Dynamic Spatial Distortion (DSD)

Spatial Distortion

Attribute Identification and Selection

New Developed Codec

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 26

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 27

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 28

!

BAQ = "0.668#TD " 0.350# LDSD " 0.179#DDSD + 86.45

Demo II: Assessing envelopment, an example of direct anchoring -George et al. (2008)

Interface allowing assessment of envelopment arising from surround recordings.

Cheer, J. (2008) The Investigation of Potential Biases in Speech Quality Assessment. Final Year Technical Project.University of Surrey, UK. Institute of Sound Recording.

Conetta, R. (2007) Scaling and predicting spatial attributes of reproduced sound using an artificial listener, MPhil/PhDUpgrade Report, University of Surrey, Institute of Sound Recording.

George S. Zielinski, S., Rumsey F., Bech S.,(2008) Evaluating the Sensation of Envelopment Arising from 5-channelSurround Sound Recordings. presented at the 124th AES convention, Paper 7298,Amsterdam, the Netherlands.

Jiao Y., Zielinski, S., Rumsey F., (2007) Adaptive Karhunen-Loeve Transform for Multichannel Audio. Presented at theAES 123rd Convention, Paper 7298, New York.

Narens, L. (1996) A Theory of Ratio Magnitude Estimation, Journal of Mathematical Psychology, vol. 40, pp. 109-129.

Poulton, E.C., (1989) Bias in Quantifying Judgments (Lawrence Erlbaum, London) .

Rumsey, R. (2002) Spatial Quality Evaluation for Reproduced Sound: Terminology,Meaning,and a Scene-BasedParadigm, J. Audio Eng. Soc. Vol. 50, 9, pp. 651-666.

Zielinski, S., Rumsey F., and S. Bech (2003) Effects of down-mix algorithms on quality of surround audio. J. Audio Eng.Soc. Vol. 51, 9, pp. 780-798.

Zielinski, S., Rumsey F., Kassier R, and Bech S. (2005) Comparison of Basic Audio Quality and Timbral and SpatialFidelity Changes Caused by Limitation of Bandwidth and by Down-mix Algorithms in 5.1 Surround Audio Systems.J. Audio Eng. Soc. Vol 53, 3, pp. 174-192.

Zielinski, S., P. Hardisty, Ch. Hummersone, and Rumsey F. (2007) Potential Biases in MUSHRA Listening Tests.Presented at AES 123rd Convention, Paper 7179, New York, NY, USA, 5-8 October.

Zielinski, S., Rumsey F., and Bech S. (2008) On Some Biases Encountered in Modern Audio Quality Listening Tests – AReview, J. Audio Eng., Vol. 56, No. 6, pp. 427-451.

References

Lannion, France, 10-12 September 2008InternationalTelecommunicationUnion 31

Some of this work was carried out with the financial support ofthe Engineering and Physical Sciences Research Council, UK(Grant EP/C527100/1).

Acknowledgment