Fusing Continuous-Valued Medical Labels Using a Bayesian Modeldavidc/pubs/abme2015_ttz.pdf ·...

Fusing Continuous-Valued Medical Labels Using a Bayesian Model

TINGTING ZHU,1 NIC DUNKLEY,1 JOACHIM BEHAR,1 DAVID A. CLIFTON,1 and GARI D. CLIFFORD1,2,3

1Department of Engineering Science, Institute of Biomedical Engineering, University of Oxford, Oxford, UK; 2 Departments ofBiomedical Informatics, Emory University, Atlanta, GA, USA; and 3Department of Biomedical Engineering, Georgia Institute

of Technology, Atlanta, GA, USA

(Received 16 March 2015; accepted 21 May 2015; published online 3 June 2015)

Associate Editor Nathalie Virag oversaw the review of this article.

Abstract—With the rapid increase in volume of time seriesmedical data available through wearable devices, there is aneed to employ automated algorithms to label data. Examplesof labels include interventions, changes in activity (e.g. sleep)and changes in physiology (e.g. arrhythmias). However,automated algorithms tend to be unreliable resulting in lowerquality care. Expert annotations are scarce, expensive, andprone to significant inter- and intra-observer variance. Toaddress these problems, a Bayesian Continuous-valued LabelAggregator (BCLA) is proposed to provide a reliable estima-tion of label aggregation while accurately infer the precisionand bias of each algorithm. The BCLA was applied to QTinterval (pro-arrhythmic indicator) estimation from the elec-trocardiogram using labels from the 2006 PhysioNet/Com-puting in Cardiology Challenge database. It was compared tothe mean, median, and a previously proposed ExpectationMaximization (EM) label aggregation approaches. Whileaccurately predicting each labelling algorithm’s bias andprecision, the root-mean-square error of the BCLA was11.78 ± 0.63 ms, significantly outperforming the best Chal-lenge entry (15.37 ± 2.13 ms) as well as the EM, mean, andmedian voting strategies (14.76 ± 0.52, 17.61 ± 0.55, and14.43 ± 0.57 ms respectively with p < 0.0001). The BCLAcould therefore provide accurate estimation for medicalcontinuous-valued label tasks in an unsupervised mannereven when the ground truth is not available.

Keywords—Crowdsourcing, Bayes methods, Time series

analysis, Electrocardiography.

INTRODUCTION

With human annotation of data, significant intra-and inter-observer disagreements exist.7,21 Expert la-belling (or ‘reading’ or ‘annotating’) of medical data byphysicians or clinicians often involves multiple over-reads, particularly when an individual is under-confi-

dent of the diagnosis. However, experts are scarce andexpensive and can create significant delays in labellingor diagnoses. Although medical training includes pe-riodic assessment of general competency, specificassessments for reading medical data are difficult to beperformed regularly. This data processing pipeline isfurther complicated by the ambiguous definition of an‘expert’. There is no empirical method for measuringlevel of expertise, even though label accuracy can varygreatly depending on the expert’s experience. As a re-sult, there exists a great deal of inter- and intra-expertvariability among physicians depending on their ex-periences and level of training.7,13,14,17,18,21

An effective probabilistic approach to aggregatingexpert labels which used an Expectation Maximization(EM) algorithm, was first proposed by Dawid andSkene.6 They applied the EM algorithm to classify theunknown true states of health (i.e. fit to undergo ageneral anaesthetic) of 45 patients given the decisionmade by five anaesthetists. Raykar et al.16 extendedthis approach to measure the diameter of a suspiciouslesion on a medical image using a regression model.Their assumption was that the discrepancies of thelesion diameter estimates from different expert anno-tators were Gaussian distributed and noisy versions ofthe actual true diameter. The precision of each expertannotator and the underlying ground truth werejointly modelled in an iterative process using EM.Welinder and Perona.22 proposed a Bayesian EMframework for continuous-valued labels, which ex-plicitly modelled the precision only of each annotatorto account for their varying skill levels, without mod-elling the bias of annotators. A more specialised formof the Bayesian model of bias was proposed byWelinder et al.23 but for binary classification tasks.However, their model cannot account for more com-plex tasks such as the continuous-valued labelling.

Address correspondence to Tingting Zhu, Department of Engi-

neering Science, Institute of Biomedical Engineering, University of

Oxford, Oxford, UK. Electronic mail: [email protected]

Annals of Biomedical Engineering, Vol. 43, No. 12, December 2015 (� 2015) pp. 2892–2902

DOI: 10.1007/s10439-015-1344-1

0090-6964/15/1200-2892/0 � 2015 Biomedical Engineering Society

2892

http://crossmark.crossref.org/dialog/?doi=10.1007/s10439-015-1344-1&domain=pdf

http://crossmark.crossref.org/dialog/?doi=10.1007/s10439-015-1344-1&domain=pdf

The methodology proposed in the work presented inthis article improves on these prior algorithms16,22,23

by introducing the novelty of combining continuous-valued annotations to infer the underlying groundtruth, while jointly modelling the annotator’s bias andprecision in an unified model using a Bayesian treat-ment.

Aggregating annotations (i.e. fusing multiple anno-tations for each piece of data from annotators withvarying levels of expertise) from human and/or auto-mated algorithms may provide a more accurate groundtruth and reduce annotator inter- and intra-variability.However, most annotators are likely to have some biasregardless of their expertise.22,25. Bias is defined as theinverse of accuracy: It measures the average differencebetween the estimation and the true value, and it isannotator dependent. An example of bias is demon-strated in Fig. 1 in the context of Electrocardiogramlabelling. Recently, Warby et al.20 studied how tocombine non-expert annotator’s labels of sleep spindlelocation, a special pattern in human electroen-

cephalography, through fusing annotations providedby non-experts. In that work, although naıve majorityvote was used to aggregate the labels of the locations,they demonstrated that non-expert annotations werecomparable to those provided by the experts (i.e. theby-subject spindle density correlation was 0.815). Ourproposed framework, in contrast, is a statistical ap-proach that models the precision and bias of eachannotator, which we hypothesise would provide a su-perior estimation of the ground truth as determined bya collection of experts.

In contrast to previous works, this article proposes aBayesian framework for aggregating multiple con-tinuous-valued annotations in medical data labelling,which takes into account the precision and bias of theindividual annotators. Moreover, we propose a gen-eralised form which can be extended to incorporatecontextual features of the physiological signal, so thatwe can adjust the weighting of each label based on theestimated bias and variance of the individual for dif-ferent types of signal. To our knowledge, the proposed

FIGURE 1. An example of bias in the context of electrocardiogram (ECG) QT interval labelling. (a) The probability density functionof the QT intervals for the reference (supplied by the human experts) annotation and annotator A (such as an automated algorithm).A plot of QT intervals across different recordings: the diagonal (grey) line indicates a perfect match of QT intervals between thereference and annotator A; the ‘o’ indicates the original QT intervals provided by annotator A; the ‘x’ indicates the bias correctedQT intervals of annotator A, which fits closely to the diagonal line. (c) An example of bias that occurs in an ECG record for labellingQT interval. The reference QT interval on a single beat starts at the beginning of the Q wave and ends at the end of the T wave(denoted as Q and T), and the biased trend from annotator A is demonstrated as Tb.

Fusing Continuous-Valued Medical Labels 2893

model for estimating continuous-valued labels in anunsupervised manner is novel in the medical domain.

MATERIALS AND METHODS

Bayesian Continuous-Valued Label Aggregator(BCLA)

Suppose that there are N records of physiologicaltime series data labelled by R annotators. LetD ¼ xTi ; y

j¼1i ; � � � ; yj¼R

i

h iNi¼1

, where xi is a column fea-ture vector for the jth record containing d features (i.e.the design matrix, X ¼ ½xT1 ; :::; xTN�), y

ji corresponds to

the annotation provided by the jth annotator for theith record, and zi represents the unknown underlyingground truth (the true time or duration of an event forexample). The graphical representation of the pro-posed approach – the Bayesian Continuous-valuedLabel Aggregator (BCLA)—is shown in Fig. 2.

In this model, it is assumed that yji was a noisyversion of zi, with a Gaussian distributionNðyji j zi; ðrjÞ

2Þ1. Here rj is the standard deviation ofthe jth annotator and represents his variance in an-notation around zi. Furthermore, the bias of eachannotator can be modelled as an additional term, /j.The probability of estimating yji can be written as:

P½yji j zi; ðrjÞ2� ¼ Nðyji j zi þ /j; 1=kjÞ: ð1Þ

where ðrjÞ2 is replaced with 1=kj. kj is the precision ofthe jth annotator, defined as the estimated inverse-variance of annotator j. Note that kj and /j are con-sidered to be constants for the jth annotator, i.e. allannotators are assumed to have consistent but usuallydifferent performances throughout records. Further-more, it is assumed that the probability of a given biasof annotator j, /j, is drawn from a Gaussian distri-bution with mean l/ and variance 1=a/, is given by:

P½/j j l/; a/� ¼ Nð/j j l/; 1=a/Þ: ð2Þ

Although the biases of the annotators might bederived from other distributions, they are likely to bedata set dependent. In the absence of any knowledge ofthe underlying distribution of biases, they are assumedto be drawn from a Gaussian distribution. Further-more, the ground truth, zi, can be assumed to be drawnfrom a Gaussian distribution with mean a and variance1/b. The probability of zi is defined as follows:

P½zi j a; b� ¼ Nðzi j a; 1=bÞ; ð3Þ

where a can be expressed as a linear regression func-tion fðw; xÞ with an intercept, and w being the coeffi-cients of the regression.16,26 The intercept models theoverall offset predicted in the regression, which is dif-ferent from the annotator specific bias in the proposedmodel. Under the assumption that records are inde-pendent, the likelihood of the parameterh ¼ fw; k;/; a/; b; zig for a given data set D can beformulated as:

P½D j h� ¼YNi¼1

P½y1i ; � � � ; yRi j xi; h�: ð4Þ

It is assumed that y1i ; � � � ; yRi are conditionally in-dependent given the feature xi (i.e. each annotatorworks independently to provide annotations). Thismay or may not be necessarily true, especially in caseswhere the annotations are generated by algorithms,some of which may be variations of the same ap-proach. Nevertheless, this assumption was made tosimplify the model and subsequent derivation of thelikelihood. The likelihood of the parameter h for agiven data set D can be written using the Bayes’ the-orem as (see detailed description in Fig. 2):

P½h j D� / P½D j h� � P½h�

¼ Cða/ j ka; #aÞ½YRj¼1

Nð/j j l/; 1=a/ÞCðkj j kk; #kÞ�

Cðb j kb; #bÞ½YNi¼1

Nðzi j a; 1=bÞYRj¼1

Nðyji j zi þ /j; 1=kjÞ�:

ð5Þ

FIGURE 2. Graphical representation of the BCLA model: yji

corresponds to the annotation provided by the jth annotatorfor the ith record, and it is modelled by the zi (the unknownunderlying ground truth), the /j (bias), and the kj (precision).Furthermore, zi is drawn from a Gaussian distribution withparameters mean a and variance 1/b, where a can be a func-tion of feature vector xi . /

j is modelled from a Gaussian dis-tribution with mean l/ and variance 1=a/. The b, kj , and a/ aredrawn from a Gamma distribution (denoted as C) with pa-rameters kb , #b , kk, #k, and ka, #a respectively.

1The motivation for this model comes from the Central Limit The-

orem. Given the assumption that the annotators are independent and

identically distributed, their labels will converge to a Gaussian dis-

tribution. In the absence of prior knowledge, this assumption allows

for a robust and generalizable model for the given data.

ZHU et al.2894

where C denotes a Gamma distribution and can bedefined as Cðz j k; #Þ ¼ 1

CðkÞ#k zk�1expð� z

#Þ, where k isthe shape of the distribution and # is the scale of thedistribution. Gamma distribution is commonly used tomodel positive continuous values. It is therefore as-sumed that precision values, such as b, kj, and a/ weredrawn from a Gamma distribution, with parameterskb, #b, kk, #k, and ka, #a respectively.

The Maximum a Posteriori Approach

The estimation of h can be solved using the max-imum a posteriori (MAP) approach, which maximisesthe log-likelihood of the parameters, i.e.argmax

hflogP½h Dj �g. The log-likelihood can be rewrit-

ten as:

logP½h j D� ¼ � 1

2

XNi¼1

XRj¼1

½logð2pkjÞ þ ðyji � /j � ziÞ2kj�

� 1

2

XRj¼1

½logð2pa/

Þ þ ð/j � l/Þ2a/�

� 1

2

XNi¼1

½logð2pbÞ þ ðzi � x

T

i wÞ2b�

þ ½ðkk � 1Þ log kj � logðCðkkÞ#ðkkÞk � kj

#k�

þ ½ðka � 1Þ log a/ � logðCðkaÞ#ðkaÞa Þ � a/

#a�

þ ½ðkb � 1Þ log b� logðCðkbÞ#ðkbÞ

b Þ � b

#b�:

ð6Þ

The parameters in h can be derived by equating thegradient of the log-likelihood to zero respectively asfollows:

1

kj¼ 1

Nþ 2ðkk � 1ÞXNi¼1

ðyji � /j � ziÞ2 þ2

#k

" #: ð7Þ

w ¼XNi¼1

xixT

i

!�1XNi¼1

xizi: ð8Þ

/j ¼ 1

Nþ a/kj

XNi¼1

ðyji � ziÞ þ l/a/kj

� �" #: ð9Þ

1

a/¼ 1

Rþ 2ðka � 1ÞXRj¼1

ð/j � l/Þ2 þ 2

#a

" #: ð10Þ

zi ¼PR

j¼1½ðyji � /jÞkj� þ ðxTi wÞbPR

j¼1 kj þ b

: ð11Þ

1

b¼ 1

Nþ 2ðkb � 1ÞXNi¼1

ðzi � xTi wÞ2 þ 2

#b

" #: ð12Þ

This MAP problem can be solved using the EMalgorithm in a two-step iterative process:

(i) The E-step estimates the expected true anno-tations for all records, z, as a weighted sum ofthe provided annotations, and can be esti-mated using Eq. (11).

(ii) The M-step is based on the current estimationof z and given the data set D. The model pa-rameters, w, /, a/, b, and k can be updatedusing Eqs. (8), (9), (10), (12), and (7) accord-ingly in a sequential order until convergence,which is now described.

Convergence Criteria for the MAP-EM Approach

When solving a MAP-EM algorithm one may en-counter a convergence issue, particularly when esti-mating a large number of parameters. The estimationof the precision may approach to infinity because theinferred annotations favour the annotator with thehighest precision in each EM update step while max-imising the likelihood. Instead of incorporating anadditional parameter for the regularisation penaltythat increases the complexity of the mode, the gener-alized extreme value distribution (GEVD) can be usedto model the maxima of the precision distribution,denoted as km, in order to restrict the upper bound ofthe precision values and guarantee a convergence in theMAP algorithm. The probability density function ofthe GEVD for km can be expressed as:

Pðkm j k; l; #Þ ¼ exp � 1þ kðkm � lÞ

#

� ��1k

( )

1

#1þ k

ðkm � lÞ#

� �ð�1�1kÞ;

ð13Þ

where k is the shape parameter, # is the scale pa-rameter, and l is the location parameter. These pa-rameters can be derived by fitting a GEVD to themaximum values drawn randomly from the prior dis-tribution of the precision, Cðk j kk; #kÞ. An upperbound of the maximum precision value can then be


obtained by estimating the 99th quantile of the inversecumulative distribution function of the GEVD.

Data Description

The electrocardiogram (ECG) is a standard andpowerful tool for assessing cardiovascular health asmany detrimental heart conditions manifest as abnor-malities in the ECG. The QT interval is one particularmeasure of ECG morphology, and refers to the elapsedtime between the onset of ventricular depolarisation(the QRS complex) and the T wave offset (ventricularrepolarisation).4 Accurate measurement of the QT in-terval is essential since abnormal intervals indicate apotentially serious but treatable condition, and can bea contraindication for the use of drugs or other inter-ventions.11 Viskin et al.19 presented the ECGs recordedfrom two patients with long QT syndrome (LQTS) andfrom two healthy females to 902 physicians (25 QTexperts who had published on the subject, 106 ar-rhythmia specialists, 329 cardiologists, and 442 non-cardiologists) from 12 countries. No other details weregiven on actual training or intrinsic accuracy of theseannotators. For patients with LQTS, 80% of ar-rhythmia specialists calculated the QTc (the heart ratecorrected QT interval) correctly but only 50% of car-diologists and 40% of noncardiologists did so. In thecontext of QT annotation where baseline wander isfrequent, it was observed that a few annotators con-sistently over- or under-estimated the QT interval.25

Other studies have reported significant intra- and inter-observer variability in QT annotations, ranging from10 to 30 ms.3,8 It is important to note that experts ornon-experts with different levels of training or exper-tise can have significantly different biases. Naıve ap-proaches to aggregate labels from a group ofannotators of unknown expertises could therefore lead

to poor results. However, annotators’ biases are rarelytaken into account when aggregating different labels oropinions in medical labelling tasks.

We hypothesise that incorporating an accurate es-timation of each annotator’s bias into a model forfusing annotations (as described in ‘‘Bayesian Con-tinuous-valued Label Aggregator (BCLA)’’ to ‘‘Con-vergence criteria for the MAP-EM approach’’ sections)will result in an improved estimate of the ground truth.In order to test this hypothesis we have used two datasets: one simulated data set to ensure an absoluteground truth is available; and one real data set of QTintervals. Although we have chosen to use QT intervaldata, because of the availability of the numerous an-notations, the method we present is more general andcan be applied to other continuous-valued annota-tions.

Simulated Data Set

To test the reliability of the BCLA as a generativemodel, a simulated data set was created: a total of 548simulated records were generated, each has 20 inde-pendent annotator, thus providing a total of 10,960annotations (see Fig. 3). The simulated data set con-sidered that annotators have precision values, k (i.e.1=

ffiffiffir

p), which were drawn from Cð4; 0:0003Þ, with as-

sumption that the annotations provided by the bestperforming annotator is ±15 ms away from the groundtruth. Annotators’ biases were drawn from Nð10; 25Þ, aGaussian distribution with 10 ms mean and a standarddeviation (1=

ffiffiffiffiffia/

p) of 25 ms. The true annotation for

each record was drawn from Nð400; 40Þ, a Gaussiandistribution with a mean, a, of 400 ms with a standarddeviation (1=

ffiffiffib

p) of 40 ms. In addition, it was assumed

that a/ was drawn from Cð3; 0:0005Þ, ensuring themean standard deviation where the biases drawn fromis 25 ms. The b was drawn from Cð3; 0:0002Þ, ensuringthe mean standard deviation where the true annota-tions drawn from is 40 ms. The generated 10,960 an-notations were then fed into the BCLA model toevaluate its accuracy in estimating the true annotationin an unsupervised manner as well as predicting thebias and precision of each annotator.

Real Data Set

The data were drawn from the QT interval anno-tations generated by participants in the 2006 Phys-ioNet/Computing in Cardiology (PCinC) Challenge15

for labelling QT intervals with reference to Lead II ineach of the 548 recordings in the Physikalisch-Tech-nische Bundesanstalt Diagnostic ECG Database(PTBDB).2 The records were from 290 subjects (209men with mean age of 55.5 years and 81 women withmean age of 61.6 years), in which 20% of the subjects

FIGURE 3. The box plot of the error between the generatedand true annotations for each of the 20 simulated annotator.The black ‘x’ indicates the bias of each annotators. The spanof each box represents the precision of the annotations (ra-ther than the interquartile range) over all annotations for eachannotator.

ZHU et al.2896

were healthy controls. An example of QT interval isdemonstrated in Fig. 1c. The PTBDB database con-tained records of patients with a variety of ECGmorphologies having different QT intervals rangingfrom 256 to 529 ms. The diagnostic classifications ofECG morphologies mainly included myocardial in-farction, heart failure, bundle branch block, and dys-rhythmia as stated in Bousseljot and Kreiseler.2

There were two main categories of annotations:manual and automated (see Table 1). A total of 38,621annotations were collected and were divided into threedivisions: 20 human annotators in Division 1, 48 closedsource automated algorithms in Division 2, and 21open source automated algorithms in Division 3.Division 4 was further created here so as to combine allautomated algorithms from Division 2 and 3 in orderto provide a larger data set and allow a better esti-mation of automated QT intervals. The number ofannotators per division and averaged number of an-notations per record are listed in Table 1. The overallpercentage of the annotators in each division withcomplete annotations (i.e. annotations on all 548recordings) was: 55% in Division 1, 40% in Division 2,43% in Division 3, and 45% in Division 4. The com-petition score for each entry was calculated from theroot mean square error (RMSE) between the submit-ted and the reference QT intervals. The reference an-notations were generated from Division 1’s entriesusing a maximum of 15 participants by taking the

‘‘median self-centering approach’’ as reported by thecompetition organisers as detailed in Ref. 24. The best-performing score for each division is also listed inTable 1. Furthermore, the majority of the QT anno-tations of each 2-min record occurred within the first5 s of the ECG recordings. The best scores in the first5-s segment were similar to those of the 2-min segment(denoted by * in Table 1). To reduce any possible in-ter-beat variations, only the annotations within thefirst 5-s segment of each record were chosen to ensurethat all annotators had approximately labelled thesame region of a record with similar QT morphologies.Therefore, the motivation for choosing the first 5-ssegment of each record was to consider a short seg-ment where the QT interval is not changing dra-matically (with respect to a particular beat anannotator chose), while retaining the highest numberof annotations. Those that fell outside this segmentwere considered to be missing information and dis-carded in the process of the QT estimation.

As the manual entry (i.e. Division 1) was used togenerate the reference annotations, we therefore fo-cused on the analysis of the automated entry (i.e.Division 2, 3, and 4). In terms of parameter setting (seeTable 2), annotator specific precision was drawn fromCðkk; #kÞ, with assumption that the annotations pro-vided by the best performing algorithm is ±5 ms awayfrom the reference. Annotators’ biases were consideredto be drawn from Nðl/; 1=

ffiffiffiffiffia/

p Þ, and a/ was modelled

TABLE 1. Performance by competition entrants on the first 5-s ECG segment for each division of the 2006 PCinC Challenge.

Manual annotatorsAutomated algorithms

Division 1 Division 2 Division 3 Division 4

Number of annotators 20 48 21 69

Average annotations per record 18 (18*) 39 (41*) 15 (21*) 54 (62*)

RMSE score (ms) 6.65 (6.67*) 16.36 (16.34*) 17.46 (17.33*) 16.36 (16.34*)

Interquartile range of score (ms) 30.40 35.77 128.00 57.00

Note: The annotator/algorithm having the lowest RMSE over the 5-s segment was selected to represent the best score. The results with *

were published in the Challenge for a 2-min segment.

TABLE 2. The parameters of the BCLA and their values for modelling the 2006 PCinC data set.

Symbol Definition Value

kb Shape of Gamma distribution for b 3�

#b Scale of Gamma distribution for b 0.0002�

l/ Mean of the bias distribution 10�

ka Shape of Gamma distribution for a/ 3�

#a Scale of Gamma distribution for a/ 0.0005�

kk Shape of Gamma distribution for k 4*

#k Scale of Gamma distribution for k 0.003*

Note b is the precision parameter for the model of the ground truth. a/ is the precision parameter for the model of the bias. k refers to

annotators’ precision values. The values with * are determined with the assumption that the annotations provided by the best performing

algorithm is ±5 ms away from the reference. The values with � are derived from Refs. 1,5,10. The values with � are derived from Refs. 4,9,12.


by Cðka; #aÞ, assuming that the automated annotationstend over-estimate manual annotations as described inprevious studies.1,5,10 The true QT interval for eachrecord was assumed to be drawn from Nða; 1=

ffiffiffib

pÞ,

where b was modelled by Cðkb; #bÞ.4,9,12 Instead ofassuming the mean (i.e. a) of the underlying groundtruth to be a fixed scalar, we updated it using a linearregression function, fðw; xÞ, where the coefficients, w,were estimated using Eq. (8). An intercept was in-cluded in fðw; xÞ to model the overall offset predictedin f, and no particular features were considered in thiscase (i.e. xi ¼ 1) as we were solely interested in theperformance of the model.

Methodology of Validation and Comparison

The BCLA inferred precision of individual algo-rithms was compared with those estimated using theEM algorithm proposed by Raykar et al.16 (denoted asEM-R) as it served as one of the benchmarking algo-rithms. Furthermore, the mean and standard deviation(l� rl ms) of 100 bootstrapped (i.e. random samplingwith replacement) samples across records from theBCLA model were compared with the best algorithm(i.e. the algorithm with highest precision after correc-tion of the bias offset), EM-R, and the traditional naıvemean and median voting approaches in both simulatedand real data sets. The mean absolute error (MAE) ofthe annotations was also calculated as it provides in-terpretation of the difference between the estimated andthe reference annotations (with a resolution of 1 ms). A

two-sided Wilcoxon rank sum test (p < 0.0001) wasapplied to the 100 bootstrapped RMSEs and MAEs, toprovide a comparison for the BCLA and EM-R vs.other methodologies. In assessing the performance ofthe BCLA as a function of the number of annotators, arandom number of annotators was selected 100 times.This was repeated with the annotator numbers variedfrom three to themaximum number of annotators in thedivision. The minimum number of annotators waschosen to be three to allow for obtaining results fromthe median voting approach. The l� rl ms of theRMSE of the BCLA, the EM-R, the mean, and themedian were calculated and compared.

RESULTS

The convergence of the BCLA model is guaranteedby providing a threshold using the GEVD as a stop-ping criteria (see Eq. (13)). In the real data set, theupper bound of the precision derived from the GEVDwas 0.04, which was based on the assumption that thebest performing annotator is ±5 ms away from thereference. The number of iteration is dependent on thenumber of records and the number of annotations. Toillustrate the practical utility of our model, it took 7.55s for the BCLA to perform 5,000 iterations whenconsidering a total of 20,712 annotations (Division 2)using MATLAB R2011a on a 2.2 GHz Intel(R) i7-2670QM processor. Approximately 2,500 iterationswere required to stabilise all the parameters.

FIGURE 4. A comparison of the simulated and inferred r in (a) and bias in (b) of each annotator in the simulated data set. Theprecision can be estimated by taking 1=ðrÞ2. The diagonal (grey) line indicates a perfect match between simulated and estimatedresults. Note that EM-R significantly over-estimates the r in all simulations.

ZHU et al.2898

Simulated Data Set

Figure 4a shows an example of the inferred resultsestimated using the EM-R and the BCLA. As the EM-R algorithm modelled jointly the precision (i.e. 1=ðrÞ2)of each annotator and the noise of underlying groundtruth, its estimated r cannot represent the real preci-sion of each annotator. Furthermore, EM-R algorithmdoes not consider the bias of each annotator, and weobserve that its estimated values of r were well abovethe line of identity, indicating a consistent over-esti-mation. In contrast, the BCLA inferred results of r lieclosely to the line of identity in the plot, indicating thatthe BCLA model can provide a reliable estimation of

the true precision in the simulated results. In additionto precision, the BCLA modelled the bias of each an-notator and the results are provided in Fig. 4b: theestimated biases are very close to the true biases. Al-though not all the estimated precisions and biases ofeach annotator were identical to the simulated values,the BCLA model inferred annotations without anyprior knowledge of who the best annotator was in anunsupervised manner.

In order to compare the accuracy of the inferredlabels using the BCLA model, the simulated 548 an-notations were bootstrapped 100 times. Each time aRMSE and MAE were generated and compared to the

TABLE 3. The RMSEs and the MAEs of the inferred labels using different strategies in the simulated data set.

Best annotator Median Mean EM-R BCLA

RMSE (ms) 34.91 ± 0.74* 18.84 ± 0.38* 13.11 ± 0.31* 14.21 ± 0.36 6.44 ± 0.34*�

MAE (ms) 30.15 ± 0.72* 12.60 ± 0.36 11.26 ± 0.30* 12.64 ± 0.36 5.14 ± 0.30*�

Results significantly different from others (p < 0.0001) as shown in � for the BCLA model and * (columns 2 to 4, and 6 only) for the EM-R.

FIGURE 5. A comparison of the 2006 PCinC Challenge reference and inferred r and bias of each annotator using the referenceprovided for division 2 in (a) and (b), division 3 in (c) and (d), and division 4 in (e) and (f) respectively. The precision can beestimated by taking 1=ðrÞ2. The leading diagonal line of each plot indicates a perfect matched between the Challenge reference andthe estimated results. The mean (i.e. bias), /, and r of the difference in annotations for Division 3 are shown in (g). The annotatorswere ranked based on their bias values. The solid line indicates the mean of the biases whereas the dotted lines indicate 1.96r ofthe mean assumed in the BCLA. Note the annotator 3, 7, and 15 are labelled in the corresponding plots.


best annotator, mean, EM-R, and median votingstrategies. The results are shown in Table 3. TheRMSE and MAE results show that BCLA inferredlabels significantly outperformed the mean, median,EM-R, and best annotator when compared with thesimulated true annotations.

Real Data Set

Figures 5a–f show the inferred precision and biasresults estimated using EM-R and BCLA for differentautomated divisions. As mentioned previously, theEM-R algorithm does not directly model the precision(i.e. 1=ðrÞ2) of each annotator; its estimated r of eachannotator produces an offset from the values providedby the reference annotations. In contrast, the BCLAinferred r results lie much closer to the line of identityin the Figs. 5a, 5c, and 5e, indicating that the BCLAmodel can provide a reliable estimation of the trueprecision of each annotator. In addition, the BCLAmodelled the bias of each annotator accurately (seeFigs. 5b, 5d, and 5f). Although automated annotator 3

and 15 were predicted by the BCLA to have lower biasvalues than those provided by the reference, they areconsidered to be outliers due to the assumption madein our model: annotators’ biases were drawn from aGaussian distribution with 10 ms mean and 25 msstandard deviation. As Fig. 5g shows, the biases ofannotator 3 and 15 lie outside the 95% of the area (i.e.±1.96r of the mean under the normal distribution)predicted by the BCLA. In the case of annotator 7, itsprecision was underestimated (see Figs. 5c and 5e),which also affected the BCLA’s estimation of its biasvalue. It was observed that only 3.47% of records wereannotated by annotator 7, making it harder for theBCLA to provide a reliable estimation of its precisionand bias values.

In the evaluation of the inferred labels, the 548 re-cords were bootstrapped 100 times, the RMSEs andMAEs of the BCLA model were generated and com-pared to the best annotator, mean, EM-R, and medianvoting approaches for the given reference. The resultsare displayed in Table 4: for Division 2 using 48 al-gorithms, the BCLA achieved a RMSE of 12.57 ± 0.67ms, which significantly outperformed other approachesand provides an improvement of 16.48% over the nextbest approach (EM-R with RMSE of 15.05 ± 0.49 ms);in the closed source entry Division 3 using 21 algo-rithms, the BCLA again exhibited a superior perfor-mance over the other methods with a RMSE of 13.90± 0.84, and a 19.48% improved error rate over thenext best method (RMSE of 17.25 ± 2.33 ms). Whenconsidering all automated entries (Division 4), theBCLA provided an even more accurate performancethan on the other two data sets (Division 2 and 3) aswell as over other methods tested with a RMSE of11.78 ± 0.63 ms.

A further evaluation of the accuracies in terms ofRMSE were made as a function of the number of an-notators (see Fig. 6). The results were generated bysub-sampling annotators with no replacement 100times. The benchmarking algorithm, EM-R outper-formed mean and median approaches initially but then

FIGURE 6. The mean and standard deviation of the RMSEresults as a function of the number of annotators for Division4 when using the BCLA, EM-R, median, and mean voting ap-proaches. Inset A close-up of the RMSE results when using 11annotators or less.

TABLE 4. The RMSEs and the MAEs of the inferred labels using different voting approaches in the 2006 PCinC data set.

Div Best annotator Median Mean EM-R BCLA

RMSE (ms)

2 15.43 ± 0.73* 15.29 ± 0.58 16.17 ± 0.54* 15.05 ± 0.49 12.57 ± 0.67*�

3 17.25 ± 2.33* 19.16 ± 0.88 30.46 ± 1.57* 18.92 ± 0.82 13.90 ± 0.84*�

4 15.37 ± 2.13* 14.43 ± 0.57* 17.61 ± 0.55* 14.76 ± 0.52 11.78 ± 0.63*�

MAE (ms)

2 10.85 ± 0.58* 11.76 ± 0.42 12.61 ± 0.43* 11.81 ± 0.40 9.29 ± 0.45*�

3 11.61 ± 3.03* 14.04 ± 0.55 22.89 ± 0.96* 14.12 ± 0.60 10.28 ± 0.67*�

4 11.17 ± 2.32* 11.21 ± 0.40* 14.16 ± 0.43* 11.49 ± 0.41 8.56 ± 0.42*�

Results significantly different from others (p < 0.0001) as shown in � for the BCLA model and * (columns 2 to 4, and 6 only) for the EM-R.

ZHU et al.2900

underperformed when compared to the median ap-proach after 43 algorithms are used. The BCLA modeloutperformed the other methods being tested with anynumber of annotators considered. In practice, it is rareto have more than three to five independent algorithmsfor estimating a label or predicting an event. In thecase where only three automated algorithms wererandomly selected, the BCLA had on average 9.02,19.82, and 24.56% improvement over the EM-R, me-dian and mean voting approaches respectively.

Although the lowest BCLA RMSE (11.78 ± 0.63ms) in the automated entry is larger than the best-performing human annotator in the Challenge (RMSE= 6.65 ms), there were only two other human anno-tators who achieved a score below 10 ms. Furthermore,as the annotations of automated algorithms were in-dependently determined from the reference, whereasthe reference includes the best human annotators, it isunsurprising that a combination of the automated al-gorithms would have worse performance.

DISCUSSION

In this article, a novel model, Bayesian Continuous-valued Label Aggregator, was proposed to infer theground truth of continuous-valued labels where accu-rate and consistent expert annotations are not avail-able. As a proof-of-concept, the BCLA was applied tothe QT interval estimation from the ECG using labelsfrom the 2006 PhysioNet/Computing in CardiologyChallenge database, and it was compared to the mean,median, and a previously proposed ExpectationMaximization label aggregation methods (i.e. EM-R).While accurately predicting each labelling participantsbias and precision, the root-mean-square error of theBCLA algorithm was significantly outperformed thebest Challenge entry as well as the EM-R, mean, andmedian voting strategies. There are two key contribu-tions in our approach: (i) the BCLA provides an esti-mation of continuous-valued annotations which isvaluable for time-series related data as well as durationof events for physiological data; (ii) It introduces aunified framework for combining continuous-valuedannotations to infer the underlying ground truth, whilejointly modelling annotators’ biases and precisions. TheBCLA operates in an unsupervised Bayesian learningframework; no reference data were used to train themodel parameters and a separate training and valida-tion test sets were not required. Combining more ex-perienced annotators would therefore provide a betterestimation of the inferred ground truth. Importantlythough, the BCLA does guarantee a performance better

than the best annotator without any prior knowledge ofwho or what is the best annotator.

Novel contextual features were introduced in ourprevious study26 which allowed an algorithm to learnhow varying physiological and noise conditions affecteach annotator’s ability to accurately label medicaldata. The inferred result was shown to provide an im-proved ‘gold standard’ for medical annotation taskseven when the ground truth is not available. As the nextstep, if we incorporate the context into the weighting ofannotators, the BCLA is expected to have an evenlarger impact for noisy data sets or annotators with avariety of specialisations or skill levels. The currentmodel assumed consistent performance of each anno-tator throughout the records: i.e. that is his/her per-formance is time-invariant. Although this might not betrue over an extended period of time where an anno-tators performance might improve through learning, ortheir performance might drop due to inattention or fa-tigue, the nature of the data sets being considered in thiswork are such that we can assume that performanceacross records is approximately consistent for eachannotator. Future work will include modelling theperformance of each annotator varying across recordsand through time to provide a more reliable estimationof the aggregated ground truth for data sets in whichintra-annotator performance is highly variant.

Our model of the annotators currently does notfactor in the possible dependency/correlation betweenindividual annotators, which might not be the case forautomated algorithms. Incorporating a correlationmeasure into the annotator’s model could possiblyallow for a better aggregation of the inferred groundtruth. Annotators who are considered to be anomalous(i.e. highly correlated but have large variances andbiases) should be penalised with lower weights; expertannotators (i.e. highly correlated but have small vari-ances and biases) should be favourably voted in themodel. Finally, combining annotations derived fromreliable experts using the BCLA model could poten-tially lead to improved training for supervised labellingapproaches.

ACKNOWLEDGMENTS

TZ acknowledges the support of the RCUK DigitalEconomy Programme grant number EP/G036861/1and an ARM Scholarship in Sustainable HealthcareTechnology through Kellogg College. ND was sup-ported be Cerner Corporation and the UK EPSRC. JBwas supported by the UK EPSRC, the Balliol French


Anderson Scholarship Fund, and MindChild MedicalInc. DAC is supported by the Royal Academy ofEngineering and Balliol College.

REFERENCES

1Andrew, W., V. Michael, D. Jeff, G. M. Nair, C. Plater-Zyberk, L. Griffith, J. Ma, C. Zachos, M. L. Sivilotti.Variability of QT interval measurements in opioid-depen-dent patients on methadone. CJAM 2:10–16, 2014.2Bousseljot, R., D. Kreiseler, A. Schnabel. Nutzung derEKG-Signaldatenbank CARDIODAT der PTB uber dasInternet. Biomed. Tech. 40:317–318, 1995.3Christov, I., I. Dotsinsky, I. Simova, R. Prokopova, E.Trendafilova, and S. Naydenov. Dataset of manuallymeasured QT intervals in the electrocardiogram. Biomed.Eng. Online 5:31, 2006.4Clifford, G. D., F. Azuaje, and P. E. McSharry. AdvancedMethods and Tools for ECG Analysis. Engineering inMedicine and Biology. Norwood,MA: ArtechHouse, 2006.5Couderc, J. P., C. Garnett, M. Li, R. Handzel, S. McNitt,X. Xia, S. Polonsky, and W. Zareba. Highly automatedQT measurement techniques in 7 thorough QT studiesimplemented under ICH E14 guidelines. Ann. NoninvasiveElectrocardiol. 16:13–24, 2011.6Dawid, A. P. and A. M. Skene. Maximum likelihoodestimation of observer error-rates using the EM algorithm.Appl. Stat. J. R. Stat. Soc. C 28:20–28, 1979.7Dekel, O. andO. Shamir. Good learners for evil teachers. In:Proceedings of the 26thAnnual InternationalConference onMachine Learning, ICML ’09, ACM, pp. 233–240, 2009.8Ehlert, F.A., J. J. Goldberger, J. E. Rosenthal, and A. H.Kadish. Relation between QT and RR intervals duringexercise testing in atrial fibrillation. Am. J. Cardiol. 70:332–338, 1992.9Goldenberg, I., A. J. Moss, W. Zareba, et al.: QT interval:how to measure it and what is ‘‘normal’’. J. Cardiovasc.Electrophysiol. 17:333–336, 2006.

10Hughes, N. P. Probabilistic Models for Automated ECGInterval Analysis. Ph.D. Thesis, University of Oxford,2006.

11International Conference on Harmonization of TechnicalRequirements for Registration of Pharmaceuticals forHuman Use: Guidance for Industry E14: Clinical Evalua-tion of QT/ QTc Interval Prolongation and ProarrhythmicPotential for Non-Antiarrhythmic Drugs, 2005. Availableat: http://www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm073153.pdf.

12Malik, M., P. Frbom, V. Batchvarov, K. Hnatkova, and A.J Camm. Relation between QT and RR intervals is highlyindividual among healthy subjects: implications for heartrate correction of the QT interval. Heart 87:220–228, 2002.

13Metlay, J.P., W. N. Kapoor, and M. J. Fine. Does thispatient have community-acquired pneumonia?: Diagnosingpneumonia by history and physical examination. J. Am.Coll. Cardiol. 278:1440–1445, 1997.

14Molinari, F., L. Gentile, P. Manicone, R. Ursini, L. Raf-faelli, M. Stefanetti, A. D’Addona, T. Pirronti, and L.Bonomo. Interobserver variability of dynamic MR imagingof the temporomandibular joint. La Radiologia Medica116:1303–1312, 2011.

15Moody, G. B., H. Koch, and U. Steinhoff. The PhysioNet/Computers in cardiology challenge 2006: QT intervalmeasurement. Comput. Cardiol. 33:313–316, 2006.

16Raykar, V. C., S. Yu, L. H. Zhao, G. H. Valadez, C.Florin, L. Bogoni, and L. Moy. Learning from crowds.JMLR 11:1297–1322, 2010.

17Salerno, S. M., P. C. Alguire, and H. S. Waxman, H.S.:Competency in interpretation of 12-lead electrocardio-grams: a summary and appraisal of published evidence.Ann. Intern. Med. 138:751–760, 2003.

18Valizadegan, H., Q. Nguyen, and M. Hauskrecht. LearningMedical Diagnosis Models from Multiple Experts. In:AMIA Annual Symposium Proceedings, pp. 921–930, 2012.

19Viskin, S., U. Rosovski, A. J. Sands, E. Chen, P. M. Kis-tler, J. M. Kalman, L. R. Chavez, P. I. Torres, F. E. CruzF,O. A. Centurion, A. Fujiki, P. Maury, X. Chen, A. D.Krahn, F. Roithinger, L. Zhang, G. M. Vincent, and D.Zeltser. Inaccurate electrocardiographic interpretation oflong QT: the majority of physicians cannot recognize a longQT when they see one. Heart Rhythm 2:569–574, 2005.

20Warby, S. C., S. L. Wendt, P. Welinder, E. G. Munk, O.Carrillo, H. B. Sorensen, P. Jennum, P. E. Peppard, P.Perona, and E. Mignot. Sleep-spindle detection: crowd-sourcing and evaluating performance of experts, non-ex-perts and automated methods. Nat. Methods 11:385–392,2014.

21Warfield, S. K., K. H. Zou, and W. M. Wells. Validation ofimage segmentation by estimating rater bias and variance.Philos. Trans. R. Soc. London A 366:2361–2375, 2008.

22Welinder, P. and P. Perona. Online crowdsourcing: ratingannotators and obtaining cost-effective labels. In: IEEEComputer Society Conference on Computer Vision andPattern Recognition Workshops, pp. 25–32, 2010.

23Welinder, P., S. Branson, P. Perona, and S. J. Belongie.The multidimensional wisdom of crowds. Adv. Neural Inf.Process. Syst. 23:2424–2432, 2010.

24Willems, J., P. Arnaud, J. van Bemmel, P. Bourdillon, C.Brohet, S. Dalla Volta, J. Andersen, R. Degani, B. Denis,M. Demeester et al. Assessment of the performance ofelectrocardiographic computer programs with the use of areference data base. Circulation 71(3):523–534, 1985.

25Zhu, T., J. Behar, T. Papastylianou, and G. D. Clifford.Crowdlabel: a crowd-sourcing platform for electrophysi-ology. Comput. Cardiol. 41:789–792, 2014.

26Zhu, T., A. E. Johnson, J. Behar, and G. D. Clifford.Crowd-sourced annotation of ECG signals using contex-tual information. Ann. Biomed. Eng. 42:871–884, 2014.

ZHU et al.2902

http://www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm073153.pdf

http://www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm073153.pdf

Fusing Continuous-Valued Medical Labels Using a Bayesian Modeldavidc/pubs/abme2015_ttz.pdf ·...

Documents

Transcript of Fusing Continuous-Valued Medical Labels Using a Bayesian Modeldavidc/pubs/abme2015_ttz.pdf ·...