Qualind: A Method for Assessing the Accuracy of Automated Tests · 2019-12-19 · Accuracy of...

J Am Acad Audiol 18:78–89 (2007)

78

*University of Minnesota, Department of Otolaryngology; †Audiology Incorporated, Arden Hills, Minnesota; ‡ArrisSystems, Edina, Minnesota; §University of Minnesota, Division of Biostatistics

Robert H. Margolis, University of Minnesota, Department of Otolaryngology, Minneapolis, MN 55455; E-mail:[email protected]; Phone: 612-626-5794

This work was supported by grant #R42 DC005110 from the National Institute on Deafness and Other CommunicationDisorders.

Disclosure: AMTAS™ and Qualind™ are the intellectual property of Robert H. Margolis and Audiology Incorporated andmay become commercial products.

Qualind™: A Method for Assessing theAccuracy of Automated Tests

Robert H. Margolis*†

George L. Saly*‡

Chap Le§

Jessica Laurence*

Abstract

As audiology strives for cost containment, standardization, accuracy of tests,and accountability, greater use of automated tests is likely. Highly skilledaudiologists employ quality control factors that contribute to test accuracy, butthey are not formally included in test protocols, resulting in a wide range ofaccuracy, owing to the various skill and experience levels of clinicians. Amethod that incorporates validated quality indicators may increase accuracyand enhance access to accurate hearing tests. This report describes a qualityassessment method that can be applied to any test that (1) requires behavioralor physiologic responses, (2) is associated with factors that correlate withaccuracy, and (3) has an available independent measure of the dimension beingassessed, including tests of sensory sensitivity, cognitive function, aptitude,academic achievement, and personality. In this report the method is appliedto AMTAS™, an automated method for diagnostic pure-tone audiometry.

Key Words: AMTAS™, audiometry, automated audiometry, cross-validation,Qualind™, quality assessment

Abbreviations: AMTAS™ = Automated Method for Testing Auditory Sensitivity;Cn = set of coefficients for QIn; K = a constant; Q = psychometric dimensionthat is tested; Qi = independent measure of Q; Qm = measure of Q producedby a test; QA = absolute difference between Qm and Qi lQm - Qil; QAavg = averageof QA for all Sn; QAavg = predicted value of QAavg; QAi-avg = average absolute difference between two independent measures of Q; QIn = set of quality indicators that are used to predict the accuracy of Qm;Qualind™ = method for predicting the accuracy of an automated test result;Sn = set of n stimuli used to test Q; SDi = standard deviation associated withQAi-avg; STTR = Small Business Technology Transfer Program; ZQA = Z scorefor QAavg based on SDi

Sumario

Conforme la audiología lucha por contener costos, es posible ver cada vezmás estandarización, exactitud en las pruebas, rendición responsable decuentas y use de pruebas automatizadas. Los audiólogos altamente calificadosemplean factores de control de calidad que contribuyen en la exactitud de laspruebas, pero éstos no están formalmente incluidos en los protocolos deevaluación, resultando en una amplia gama de exactitudes, relacionada con

Accuracy of Automated Tests/Margolis et al

79

The availability of powerful, low-costcomputers and their incorporation intodiagnostic test equipment provides an

increasing capability to automate diagnostictesting in audiology and other health fields.At the same time, current health-careeconomics requires greater attention toefficiency and cost containment. In additionto efficiency and lower cost, automation offersother advantages such as standardization,increased accuracy, quality assessment,increased accessibility, and integration intopatient databases and electronic medical records.

When hearing tests are conducted byhighly skilled audiologists, the clinicianemploys a variety of quality control factorsthat contribute to test accuracy. Becausethese factors are not formally included inhearing test protocols, there is a wide rangeof quality of results, owing to the variousskill and experience levels of clinicians.Automation provides the capability toquantitatively track these factors andformally incorporate them into the testresults. Quality assessment of test results isparticularly important for automatedprocedures or other tests in which an expertobserver is not present during the test todetect problems that may impact the accuracyof results.

In considering methods for assessing theaccuracy of test results, it is important todistinguish between tests that produce binaryresults (pass-fail, disease–no disease, normal-abnormal) and those that produce continuousmeasurements, such as pure-tone audiometry,blood pressure, visual acuity, and intelligencetests. The latter are perhaps better referredto as “measurements,” “the assignment ofnumerals to objects or events according torules” (Stevens, 1951). Measurements, then,can become the basis for tests, rules formaking decisions based on observations.From this point of view, pure-tone audiometryis a measurement. It becomes a test whenrules are employed to make clinical decisionsbased on the results, such as thecategorization of hearing loss severity ortype. However, it is common in audiology,medicine, and psychology to refer tomeasurements that are made for the purposeof making a diagnosis or evaluating atreatment as “tests.” Methods for evaluatingthe accuracy of binary tests are described inthe biostatistics literature (Zhou et al, 2002;Pepe, 2003).

This report describes a qualityassessment method (Qualind™) that can beapplied to any measurement that meets thefollowing requirements: (1) the procedure

los diferentes de niveles de habilidad y experiencia de los clínicos. Un métodoque incorpore indicadores validados de calidad puede incrementar la exactitudy aumentar el acceso a pruebas precisas de audición. Este reporte describeun método de evaluación de calidad, que puede ser aplicado a cualquier pruebaque (1) requiera de una respuesta conductual o psicológica, (2) esté asociadacon factores que correlacionen con la exactitud, y (3) que posea una medidadisponible independiente de la dimensión evaluada, incluyendo pruebas desensibilidad sensorial, de función cognitiva, de aptitud, de logro académico yde personalidad. En este reporte el método es aplicado a AMTAS™, unmétodo automatizado para audiometría diagnóstica de tonos puros.

Palabras Clave: AMTAS™, audiometría, audiometría automatizada, validacióncruzada, Qualind™, evaluación de calidad

Abreviaturas: AMTAS™ = Método automatizado para Evaluar SensibilidadAuditiva; Cn = set de coeficientes para QIn; K = una constante; Q = dimensiónpsicométrica que es evaluada; Qi = medida independiente de Q; Qm = medidade Q producida por una prueba; QA = diferencia absoluta entre Qm y Qi lQm - Qil;QAavg = promedio de QA para todas las Sn; QAavg = valor de predicción deQAavg; QAi-avg = diferencia promedio absoluta entre dos medidasindependientes de Q; QIn = set de indicadores de calidad que se usa parapredecir la exactitud de Qm; Qualind™ = método para predecir la exactitudde un resultado de una prueba automatizada; Sn = set de n estímulo usadosen la prueba Q; SDi = desviación estándar asociada con QAi-avg; STTR =Programa de Transferencia de Tecnología para Pequeños Negocios; ZQA =puntaje Z para QAavg con base en la DSi

Journal of the American Academy of Audiology/Volume 18, Number 1, 2007

80

provides a continuous measurement of thedimension being tested; (2) the measurementsresult from behavioral or physiologicresponses from the individual being tested;(3) quantifiable subject characteristics andbehaviors exist that correlate with testaccuracy; and (4) an independent measure ofthe dimension being assessed is available.Tests of this type include (1) tests of sensorysensitivity to physical stimuli including testsof hearing, vision, tactile sensation, andolfaction; (2) tests of cognitive function; (3)aptitude tests; (4) academic achievementtests; and (5) personality tests.

Qualind is based on a class of statisticaltechniques known as “response surfacemethodology.” These techniques “seek torelate a response, or output variable to thelevels of a number of predictors, or inputvariables, that affect it” (Box and Draper,1987, p. 1). Although not specificallyaddressed in the audiology literature, it iswidely recognized that certain observablevariables are predictive of the accuracy ofpure-tone test results. The ubiquitous“reliability” judgment found on audiogramforms assumes that a skilled audiologist canobserve certain patient behaviors andcharacteristics that predict accuracy. If thisis the case, then perhaps these variables canbe quantitatively tracked and exploited foraccuracy assessment, which is amenable tostatistical validation. Once derived, thepredictive equation can be validated byindependent data sets, by the method ofcross-validation (Schneider and Moore, 2000).

The development of Qualind for assessingthe accuracy of an automated method forpure-tone audiometry, based on responsesurface methodology, is described. Theautomated audiometric method, AMTAS™,is described below. In Experiment 1, apredictive equation was derived thatcalculates the estimated average differencebetween AMTAS and manual thresholds. InExperiments 2 and 3, the accuracy of thepredictive equation was cross-validatedagainst two independent data sets.

AMTAS™: AUTOMATED METHODFOR TESTING AUDITORY

SENSITIVITY

AMTAS (U.S. Patent #6,496,585) is anautomated method for obtaining a

diagnostic pure-tone audiogram, includingair- and bone-conduction thresholds withmasking in the nontest ear and withquantitative quality indicators. Thedevelopment and testing of the method wassupported by the National Institutes ofHealth Small Business Technology Transfer(STTR) Program. The results of a comparisonbetween AMTAS and manual testing will bereported separately.

AMTAS is a single interval, Yes-No,psychophysical procedure with feedback.Catch trials are presented randomlythroughout the test to allow a quantitativemeasurement of false-alarm rate. A “qualitycheck” is performed after each thresholddetermination by presenting a stimulus witha higher intensity than the threshold level.A “No” response to that stimulus is a “QualityCheck Fail.” Masking is always presentedto the nontest ear by a proprietary methodthat estimates the appropriate masker levelfrom the signal level and interauralattenuation of the transducers. At theconclusion of the test, AMTAS identifies“masking alerts,” thresholds for whichovermasking or undermasking may haveoccurred.

The method requires that the air- andbone-conduction transducers are placed atthe beginning of the test without therequirement that transducers be movedduring the evaluation. The bone vibrator isplaced on the forehead. Prototypenonoccluding earphones are used so thatbone-conduction thresholds could bedetermined without contamination by theocclusion effect. The development of anonoccluding earphone for audiometry is thesubject of a related STTR project.

The development of the qualityassessment method was based on acomparison of audiograms obtained byAMTAS and by manual testing performedby expert audiologists.

QUALIND™: A METHOD OFASSESSING THE ACCURACY OF AN

AUTOMATED TEST

Qualind (patent pending) is a methodfor determining the accuracy of a test resultfrom quantifiable factors (subjectcharacteristics and behaviors) that arecorrelated with test accuracy. The accuracy

of the prediction is dependent upon thestrength of the association between the factorsand the test results. The process ofdevelopment and validation of Qualind for aspecific test is comprised of the followingsteps.

1. Identify a psychometric quantity,“Q,” that can be measured by aproperly constructed test.

2. Develop a test of Q that requiresbehavioral or physiologic responsesto a set of stimuli Sn such that theaggregate of the responses to Snprovides a quantitative measure Qmof Q. Sn could be physical signals,images, or questions which requirebehavioral or physiologic responsesin accordance with well-definedinstructions and procedures.

3. Identify n measurable behaviors,QIn, that may be related to thequality of the subjects’ responses toSn.

4. Identify an independent measure ofQ, Qi, against which Qm can becompared.

5. Obtain a data set by testing a sampleof a defined population providingvalues of Qm, Qi, and QIn. For eachSn calculate the absolute differenceQA = lQi - Qml, which is regarded asa measure of the accuracy of Qm.

6. Calculate QAavg, the average QA forall of Sn. QAavg represents a globalmeasure of test accuracy for thesubject. QAavg could be calculatedon subsets of Sn as well to obtainaccuracy estimations of variouscomponents of a test. In audiometryfor example, separate accuracypredictions could be derived for airconduction versus bone conduction,right ear versus left ear, highfrequencies versus low frequencies,and so on

7. Derive a predictive equation toestimate QAavg from QIn. Thepredictive equation may take theform

QAavg = f(QIn) (1)

The italics indicate a calculatedestimate, to distinguish it fromQAavg, which is a measured averagedifference between Qm and Qi. One

way to obtain f(QIn) is to perform amultiple regression of QIn on QAavg.The strength of the regressiondetermines the accuracy of QAavg.

8. If desired, QAavg can be converted tocategorical data such as a scaleconsisting of descriptive terms like“good,” “fair,” and “poor.” Thesecategories can be based on thevariance associated with QAavg for acertain population.

In the case of AMTAS, the psychometricquantity (Q) that is measured is hearingsensitivity as expressed by the pure-toneaudiogram, including air- and bone-conduction thresholds for standardaudiometric frequencies. The independentmeasure Qi is the audiogram obtainedmanually by an expert audiologist. For eachaudiometric threshold, QA = lQi - Qml isobtained by determining the absolutedifference in auditory thresholds obtainedby AMTAS and by the manual method. Theaverage absolute difference for all test stimuliemployed in the test is QAavg.

QI candidates can be any quantifiablesubject characteristic or behavior that maybe related to test accuracy. Potential AMTASquality indicators are shown in Table 1.

EXPERIMENT 1: DERIVATION ANDVALIDATION OF PREDICTIVE

EQUATION

Subjects and Methods

Data were collected at three sites chosento sample a wide range of settings, patientdemographics, and hearing losscharacteristics. Subjects were recruited fromthe audiology clinics at each site. A patientwas eligible for the study if the clinicianjudged that the patient was capable ofunderstanding instructions that are typicallyprovided for pure-tone audiometry. Noconstraints were placed on the degree or typeof hearing loss. Immittance testing was notconsidered. The test sites, subject samples,and hearing loss characteristics are shown inTable 2.

Each site was equipped identically withcommercial audiometers (Grason-StadlerGSI-61) and personal computers. Manualtesting was performed with TDH-50


81

earphones calibrated in accordance withANSI S3.6-1996 (American NationalStandards Institute, 1996). For automatedtesting, a prototype, nonoccluding,circumaural earphone was used. ReferenceEquivalent Sound Pressure Levels for theseearphones were derived by the methoddescribed in Annex D Par. D.4 of the standard.

For AMTAS testing, the computercontrolled the audiometer, acquired a trial-

by-trial history of the entire test, recordedthresholds, tracked the quality indicators,and transmitted the data files to a centralserver via a secure internet communicationprotocol. For manual testing, the computerlogged the characteristics of each stimulusthat was presented, including frequency,intensity, ear, mode (air or bone), and time ofpresentation, and stored threshold andmasking values.


82

Table 1. QI Definitions

AMTAS QUALITY INDICATORS DEFINITION

Patient Age§ Self-explanatory

Patient Gender§ Self-explanatory

Masker Alert Rate The number of thresholds for which the masking noise presented to the nontest ear may have been either too low or too high divided by the number of measured thresholds.

Time per Trial The elapsed time averaged across all observation intervals

Average Number of Trials for The total number of observation intervals divided by the number of measured Threshold§ thresholds

Elapsed Time§ The total elapsed time for the test

False Alarm Rate The number of false alarms (trials in which the subject reported the presence of a stimulus when no stimulus was presented) divided by the total number of catch trials (trials in which there was no stimulus)

Average Test-Retest Difference The average difference in threshold measures obtained for stimuli that were testedtwice

Quality Check Fail Rate The total number of occurrences of quality check fails (failure to respond to stimulipresented above threshold) divided by the number of measured thresholds

Air-Bone Gap >50 dB Number of air-bone gaps (difference between thresholds obtained for air- and bone-conducted stimuli for each frequency/ear combination) that exceed 50 dB

Air Bone Gap <-10 dB Number of air-bone gaps (difference between thresholds obtained for air- and bone-conducted stimuli for each frequency/ear combination) that are less than -10 dB

Average Air-Bone Gap The average difference between air-conduction threshold and bone-conduction threshold

§Omitted from final analysis.

Table 2. Experiment 1 Subject Characteristics

Site n Gender Age (Yrs.) Pure-Tone Average§ Pure-Tone Average§ Pure-Tone Average§

Male Mean (SD) Better Ear (dB) Poorer Ear (dB) Interaural Difference (dB) Female Range Mean (SD) Mean (SD) Mean (SD)

Range Range Range

University 54 32 52 (18) 29 (19) 43 (23) 11 (13)of Minnesota 22 16 to 87 -3 to 70 1 to 104 0 to 69

University 21 11 47 (22) 29 (20) 35 (21) 6 (7)of Utah 10 12 to 76 -5 to 55 0 to 64 0 to 33

VA Mountain 45 44 72 (9) 50 (15) 60 (18) 10 (12) Home 1 48 to 93 19 to 94 30 to 104 0 to 48

All 120 87 59 (18) 37 (20) 48 (23) 11 (13)33 12 to 93 -5 to 94 0 to 104 0 to 69

§ 0.5, 1.0, 2.0, and 4.0 kHz

Data were collected by highly experiencedaudiologists at each test site. Testers wereinstructed to perform manual audiometryusing the clinical methods that they normallyuse when testing patients. Each tester wasvalidated against another audiologist (RHM)on six subjects at each site. The six subjectswere chosen randomly and tested on site bythe two testers in immediate succession.Inter-tester reliability for the three sitesranged from 0.95 to 0.97 (Table 3). The inter-tester validation study producedmeasurements of the differences in thresholdsmeasured by the two audiologists. These datawere used to establish accuracy categories forQualind predictions.

Each subject was tested by AMTAS andby manual audiometry in immediatesuccession. The order of the tests wasrandomized. When manual testing wasconducted second, the tester was not awareof the AMTAS results.

An equation that calculates the predictedaverage absolute difference between AMTASand manual thresholds, QAavg, was derivedby performing a multiple regression betweenQAavg and the quality indicators that wereselected from Table 1. QI measures that didnot contribute to the strength of theregression were discarded. For the remainingfactors, the multiple regression returned a setof coefficients (Cn) and an intercept (K). Theresulting formula is

QAavg = lQi - Qmlavg = f(QIn) = Σ(C·QI) + K (2)

Note that QAavg relies only on QIn andnot Qi or Qm. It is an estimate of the accuracyof the test result (relative to results obtainedby an expert professional). Its accuracy isdetermined by the strength of the multipleregression.

Results

A multiple regression of the qualityindicators selected from Table 1 and QAavgrevealed that the following four did notcontribute to the strength of theregression–age, gender, average number of

trials, and elapsed time. These were discardedand the multiple regression repeated with theremaining eight. The resulting regressioncoefficient was 0.84, indicating that the eightquality indicators account for 71% of thevariance. This result suggests that QAavg isa good predictor of QAavg. However, the valueof QAavg is not a very useful measure forjudging the accuracy of an individualaudiogram. An additional step is necessaryto provide the clinician with a useful qualityassessment indicator.

One way to interpret QAavg is to compareit to differences obtained between audiogramsobtained by two experienced audiologists onthe same patient. That is, two independentmeasures of Qi are obtained. The inter-testervalidation study provided a basis for thistype of comparison. That study provided ameasurement of the average absolutethreshold difference for two audiologists:

QAi-avg = lQAi1 - QAi2l (3)

The value of QAi-avg from that study was4.62 dB with a standard deviation SDi of4.13. Using these results, each QAavg wasconverted to a Z score that is the distance instandard deviation units from the meanaverage absolute difference for twoaudiologists. The Z score is calculated by

ZQA = (QAavg - QAi-avg)/SDi (4)

These Z scores were then categorizedinto three groups, Good, Fair, and Poor, usingthe rules given in Table 4. This processresulted in the occurrences of Good, Fair,and Poor results in the data set of 120individuals with hearing loss shown in Table 5.

An examination of the audiogramsrevealed a high degree of face validity to thecategorical data. That is, those in the Goodcategory were generally free of obvious errorsthat would be evident to skilled audiologists,and those in the Poor category generally wereassociated with obvious errors such asunlikely audiometric configurations, highfalse alarm rates either in the aggregate orfor specific stimuli, and theoreticallyimpossible results such as air-bone gaps >50dB or <-10 dB.


83

Table 3. Inter-tester Correlation Coefficients for Each Test Site

Site University of Minnesota University of Utah VA Mountain Home

Inter-tester Correlation 0.95 0.97 0.97

Agreement between categories based onpredicted and actual accuracy (i.e., QAavg vs.QAavg) is shown in Table 6. Overall, for 78%of the cases, the predicted category wasidentical to the actual category (sum of valuesin bold divided by total). Sixty-nine percentof the cases were predicted to have “Good”accuracy, and of those, 92% had measuredaccuracy in the “Good” range. Agreement forthe other two categories was not as good. Of25 cases predicted to be in the “Fair” category,ten (40%) had actual accuracy in the “Fair”range. Of twelve cases predicted to have“Poor” accuracy, seven (58%) had actualaccuracy in the “Poor” range.

Perhaps the most serious error is one inwhich accuracy is predicted to be “Good” andis actually “Poor.” Only one error of this typeoccurred. The reverse case is one in whichaccuracy is predicted to be “Poor” but isactually “Good.” Two errors of this typeoccurred. These may require unnecessaryretesting but are not likely to result in amischaracterization of the hearing loss if theretesting is done by an experiencedaudiologist.

QAavg is plotted against QAavg in Figure1. The categories are shown by the shadedregions. The correlation between QAavg andQAavg is by definition identical to the multipleregression coefficient of 0.84. The figureillustrates that the seven most inaccurateaudiograms (the boxed data points in Figure1) were correctly categorized in the Poorcategory. Most of these cases were subjectswho did not understand the instructionsregarding masking and voted “Yes” when themasking was audible whether or not theyheard the tone. Although these are clearlyinvalid audiograms, they were left in theanalysis because it is important to test thepower of Qualind to detect such cases.

EXPERIMENT 2: CROSS-VALIDATIONAGAINST AN INDEPENDENT DATA SET

The method of cross-validation provides anevaluation of a predictive equation by

determining the accuracy of predictions fordata sets other than the one from which thepredictive equation was derived (Schneider


84

Table 4. Category Definitions

Category ZQA Score

Good ZQA ±1

Fair 1 < ZQA < 2

Poor ZQA ± 2

Table 5. Number of Occurrences of Each QAavgCategory

Category Number of Occurrences %

Good 83 69

Fair 25 21

Poor 12 10

Table 6. Predicted versus Actual Category Agreement between AMTAS and Manual Audiograms for theSubjects in Experiment 1

Predicted Category

Good Fair Poor TOTALn n n n% of column % of column % of column %% of total % of total % of total

Good 76 11 2 8992 44 17 7463 9 2

Fair 6 10 3 197 40 25 165 8 3

Poor 1 4 7 121 16 58 101 3 6

TOTAL n (%) 83 (69) 25 (21) 12 (10) 120

Act

ual C

ateg

ory

and Moore, 2000). In Experiment 2 a smallgroup of older subjects was tested by methodsidentical to those of Experiment 1 for thispurpose.


Eight adult subjects (six males, twofemales) ranging in age from 64 to 85 yearswere tested at one test site (University ofMinnesota). Patient characteristics aresummarized in Table 7. Hearing losses variedover a wide range indicated by the pure-toneaverages shown in the table. The testprocedure was identical to that of Experiment1. It was not possible to derive a newpredictive equation from this data set becauseof the small number of subjects. Instead,predicted absolute differences were calculatedusing the regression formula derived fromExperiment 1.

Results

A comparison of predicted and actualcategories of average absolute differencesbetween AMTAS and manual audiogramsfor the subjects in Experiment 2 is shown inTable 8. For seven of eight cases (87%), thepredicted and actual categories were thesame.

A comparison of the predicted andmeasured average absolute differencesbetween AMTAS and manual audiometryare shown in Figure 2. The figuredemonstrates that the one case for whichthe predicted and measured differences fellinto different categories was a near miss (seecircled data point in Figure 2).

The correlation coefficient between themeasured and predicted average absolutedifferences was 0.59, indicating that thequality indicators account for 35% of thevariance in the differences between QAavgand QAavg . The lower correlation coefficient(0.59) compared to the multiple regressioncoefficient obtained in Experiment 1 (0.84)may have resulted from the narrow range of


85

Figure 1. Predicted average absolute differencesbetween AMTAS and manual thresholds plottedagainst the measured average absolute difference.Average absolute differences are the absolute valuesof the differences between AMTAS and manualthresholds averaged across all thresholds (air andbone, both ears) for each subject. The predicted aver-age absolute differences are calculated from theregression formula produced by the process describedin the text. The Good, Fair, and Poor regions arebased on the average differences between manualthresholds obtained by two audiologists for the samesubjects. (See text for full explanation.) The sevencases with the poorest accuracy (enclosed in thesquare) were correctly predicted to have the poorestaccuracy.

Table 7. Experiment 2 Subject Characteristics (n = 8) and Results

Age Pure-Tone Average Pure-Tone Average Interaural Difference Difference between (years) (dB) Better Ear (dB) Worse Ear in Pure-Tone Predicted and Measured

(0.5, 1, 2, 4 kHz) (0.5, 1, 2, 4 kHz) Average (dB) Absolute Difference (dB)

Mean 70.2 25 35 10 1.2

SD 7.2 12 21 11 0.9

Range 64–85 9–44 13–76 1–35 0.9–3.0

the differences in Experiment 2. Unlike theheterogeneous distribution of accuracy inExperiment 1, these subjects were quitehomogeneous, indicated by the substantialdifference in the standard deviations for

average absolute difference (4.7 inExperiment 1 vs. 2.4 in Experiment 2). All ofthe results fell into the “Good” and “Fair”categories with none in the “Poor” category.Because the correlation statistic is highlydependent on the range of the data (Gamesand Klare, 1967, p. 369), it is expected thatthe correlation in Experiment 2 would belower than the multiple regression inExperiment 1.

To examine the possibility that the lowercorrelation in Experiment 2 may have beeninfluenced by the narrower range of resultsrelative to Experiment 1, a subset of subjectsfrom Experiment 1 was selected such that therange of values of QAavg was identical to therange in Experiment 2. The correlationcoefficient for this data set is 0.45, lower thanthe 0.59 obtained in Experiment 2. This resultsuggests that over that range, the relationshipbetween QAavg and QAavg is stronger inExperiment 2 than in Experiment 1.

The average absolute difference betweenthe measured and predicted average absolutedifferences was 1.9 dB (range = 0.4–2.4 dB),suggesting a high degree of predictability ofQA from the quality indicators.

The categorical agreement (87%), thecorrelation between QAavg and QAavg, and thesmall mean difference between predicted andmeasured accuracy indicate that theregression equation produced in Experiment1 provides a high degree of predictability forthe results of Experiment 2. In addition, theslopes and intercepts of the best fit regressionlines (see equations in Figures 1 and 2) aresimilar, suggesting very similar relationshipsbetween QAavg and QAavg in the two subject


86

Table 8. Predicted versus Actual Category Agreement between AMTAS and Manual Audiograms for the Subjects in Experiment 2

Predicted Category

Good Fair TOTALn n n% of column % of column %% of total % of total

Good 5 1 6100 33 7563 13

Fair 0 2 20 67 250 25

TOTAL n (%) 5 (63) 3 (37) 8

Act

ual C

ateg

ory

Figure 2. Predicted and measured average absolutedifferences between AMTAS and manual thresholdsfor the eight listeners tested in Experiment 2. Pre-dictions were based on the regression equation derivedin Experiment 1. The “GOOD” and “FAIR” labelsindicate the ranges determined from average differ-ences between manual testing performed by twoaudiologists for the same subjects. The horizontallyaligned “GOOD” and “FAIR” labels indicate the pre-dicted categories. The vertically aligned labels indi-cate the categories based on measured differences.

groups. The results of this cross-validationanalysis suggest that a regression equationobtained from one data set obtained byAMTAS and manual testing are useful forpredicting the accuracy of audiograms inanother data set, provided the subjects arereasonably similar.

EXPERIMENT 3: CROSS-VALIDATIONFOR INSERT EARPHONES


To determine the extent to which thepredictive equation derived in Experiment 1might extend to variations in methodology, across-validation study was conducted at theMinnesota site for a group of subjects testedwith a different earphone. In this study,AMTAS thresholds were measured withinsert earphones (Etymotic Research ER3A),and manual thresholds were measured withsupra-aural earphones (Telephonics TDH-50). Each was calibrated by standardcalibration procedures (American NationalStandards Institute, 1996). The ER3Aearphone was coupled to the ear with eitherthe small or adult standard foam tips. Theappropriate size was selected for each subject,the foam tip was compressed, and it wasinserted into the ear canal such that thelateral surface of the tip was at the level ofthe entrance to the ear canal. With thisinsertion depth, an occlusion effect is expectedthat affects low-frequency bone-conductionthresholds (Dean and Martin, 2000). DuringAMTAS testing, both ears were occluded.During manual audiometry, only the nontestear was occluded during bone-conductiontesting. Therefore, a difference in low-

frequency bone-conduction thresholds isexpected that will affect the overall agreementbetween AMTAS and manual thresholds.Subject characteristics are summarized inTable 9.

Two predictions were derived for eachsubject. QAavg was calculated using thecoefficients produced by the multipleregression performed in Experiment 1. Thisis the cross-validation approach. In addition,a set of coefficients were obtained from anew multiple regression on the data fromExperiment 3.

Results

The differences between QAavg and QAavgfor the two predictive equations are shown inthe last two columns of Table 9. In both casesthe average differences were small with thenew regression equation providing smallerdifferences and smaller ranges.

The performance of the two equationsfor assigning audiograms to categories isshown in Table 10. Again, the new predictiveequation performed slightly better with 33cases (92%) correctly classified as “Good”compared to 31 (86%) for the equation fromExperiment 1. It is not surprising that theregression based on the data from Experiment3 would be more predictive than a predictiveequation derived from a different data set.

The superior performance of the newpredictive equation is clearer in Figures 3 and4, which show the relationship between QAavgand QAavg for each predictive equation. Thenew multiple regression is 0.68 (which isequivalent to the correlation between QAavgand QAavg). The correlation based on theExperiment 1 predictions is 0.32, and theplot in Figure 4 shows substantially more


87

Table 9. Experiment 3 Subject Characteristics (n = 36) and Results

Age Pure-Tone Pure-Tone Absolute Interaural Cross-Validation: New Regression: (years) Average (dB) Average (dB) Difference in Absolute Difference Absolute Difference

Better Ear Worse Ear Pure-Tone between Predicted between Predicted (0.5, 1, 2, 4 kHz) (0.5, 1, 2, 4 kHz) Average (dB) and Measured and Measured

Average Absolute Average Absolute Difference (dB)a Difference (dB)b

Mean 61.3 39 49 10 1.5 1.0

SD 18.2 22 22 14 1.2 0.8

Range 13 to 86 -4 to 75 3 to 89 0 to 62 0.1 to 4.0 0 to 2.8

a From the coefficients derived in Experiment 1. b Predictions from new multiple regression.

scatter for the cross-validation than for thenew predictive equation (Figure 3). In spiteof the larger sample size, the cross-validationanalysis indicated a weaker predictiverelationship compared to the results ofExperiment 2.

Contributing to the poorer performanceof the cross-validation predictions may bethe occlusion effect produced by the insertearphones. When subjects with conductivehearing losses were excluded, the boneconduction thresholds with insert earphones


88

Table 10. Predicted versus Actual Category Agreement between AMTAS and Manual Audiograms for the Subjects in Experiment 3

Predicted Category

Cross-Validation New Predictive Equation

Good Fair TOTAL Good Fair TOTALn n n n n n% of column % of column % % of column % of column %% of total % of total % of total % of total

Good 31 3 34 33 1 3494 100 94 94 100 9486 8 92 3

Fair 2 0 2 2 0 26 0 6 6 0 66 0 6 0

TOTAL (%) 5 (63) 3 (37) 36 (100) 35 (97) 1 (3) 36 (100)

Act

ual C

ateg

ory

Figure 3. Predicted and measured average absolutedifferences between AMTAS and manual thresholdsfor the 36 listeners tested in Experiment 3. Pre-dicted average absolute differences (QAavg) were cal-culated from a multiple regression formula based onthis data set. The “GOOD” and “FAIR” labels indi-cate the ranges determined from average differencesbetween manual testing performed by two audiolo-gists for the same subjects. The horizontally aligned“GOOD” and “FAIR” labels indicate the predictedcategories. The vertically aligned labels indicate thecategories based on measured differences.

Figure 4. Predicted and measured average absolutedifferences between AMTAS and manual thresholdsfor the 36 listeners tested in Experiment 3. Predictedaverage absolute differences (QAavg) were calculatedfrom a multiple regression derived in Experiment 1.The “GOOD” and “FAIR” labels indicate the rangesdetermined from average differences between man-ual testing performed by two audiologists for thesame subjects. The horizontally aligned “GOOD” and“FAIR” labels indicate the predicted categories. Thevertically aligned labels indicate the categories basedon measured differences.

averaged 9 dB and 6 dB lower (better) at 250and 500 Hz, respectively, compared to thesupra-aural earphones. These differences didnot occur in Experiments 1 and 2, in whicha nonoccluding earphone was used forAMTAS testing.

The results of this analysis suggest thatthe strength of the predictive equation derivedin Experiment 1 is compromised when anearphone with different characteristics isemployed. However, the regression derivedfrom the Experiment 3 data set appeared tohave similar predictive power. Thus, it maybe necessary to derive predictive equationsthat are specific to differences ininstrumentation that may differentially affectresults, such as predicting unoccluded bone-conduction results from measurementsobtained with an occluding earphone.

SUMMARY AND CONCLUSION

Amethod was developed based on responsesurface methodology for assessing the

accuracy of a diagnostic audiogram obtainedby a computer-controlled, automatedprocedure. A predictive equation was derivedfrom a multiple regression of a set ofquantitative quality indicators on a measureof test accuracy, defined as the averageabsolute difference between automated andmanually tested thresholds. For a largesubject sample (n = 120), a strong relationshipwas found between predicted and measuredaccuracy. The predictive equation was cross-validated against two independent data sets.The results suggest that the predictions retaintheir accuracy for independent data sets ifsimilar subjects and methods are employed,and that new predictive equations may berequired for significant variations in testmethodology. The method may be useful forautomated test procedures when skilledprofessionals are not available to providequality assurance.

Acknowledgments. Lisa Hunter and Richard Wilsonhave been invaluable colleagues in this project. M.Victor Berrett, Allison Kohtz, and Deborah Weakleycollected the data, and their expertise and profes-sionalism were critical to the study. We are gratefulto Aaron R. Thornton and two anonymous reviewersfor their critical reading of the manuscript.

REFERENCES

American National Standards Institute. (1996)American National Standard Specification forAudiometers (ANSI3.6-1996). New York: AcousticalSociety of America.

Box GEP, Draper NR. (1987) Empirical Model-Building and Response Surfaces. New York: JohnWiley.

Dean MS, Martin FN. (2000) Insert earphone depthand the occlusion effect. J Am Acad Audiol 9:131–134.

Games PA, Klare GR. (1967) Elementary Statistics:Data Analysis for the Behavioral Sciences. New York:McGraw-Hill.

Pepe MS. (2003) The Statistical Evaluation of MedicalTests for Classification and Prediction. New York:Oxford University Press.

Schneider J, Moore A. (2000) A Locally WeightedLearning Tutorial Using Vizier 1.0. Tech. Report CMU-RI-TR-00-18, Robotics Institute, Carnegie MellonUniversity. http://www.cs.cmu.edu/~schneide/tut5/node42.html.

Stevens SS. (1951) Mathematics, measurements, andpsychophysics. In: Handbook of ExperimentalPsychology. New York: John Wiley.

Zhou XH, Obuchowski NA, McClish DK. (2002)Statistical Methods in Diagnostic Medicine. New York:John Wiley.


89

Qualind: A Method for Assessing the Accuracy of Automated Tests · 2019-12-19 · Accuracy of...

Documents

Transcript of Qualind: A Method for Assessing the Accuracy of Automated Tests · 2019-12-19 · Accuracy of...