Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF...

11
Deep Multi-scale Location-aware 3D Convolutional Neural Networks for Automated Detection of Lacunes of Presumed Vascular Origin Mohsen Ghafoorian 1a,b , Nico Karssemeijer b , Tom Heskes a , Mayra Bergkamp c , Joost Wissink c , Jiri Obels b , Karlijn Keizer c , Frank-Erik de Leeuw c , Bram van Ginneken b , Elena Marchiori a , Bram Platel b a Institute for Computing and Information Sciences, Radboud University, Nijmegen, The Netherlands b Diagnostic Image Analysis Group, Department of Radiology and Nuclear Medicine, Radboud University Medical Center, Nijmegen, The Netherlands c Donders Institute for Brain, Cognition and Behaviour, Department of Neurology, Radboud University Medical Center, Nijmegen, The Netherlands Abstract Lacunes of presumed vascular origin (lacunes) are associated with an increased risk of stroke, gait impairment, and dementia and are a primary imaging feature of the small vessel disease. Quantification of lacunes may be of great importance to elucidate the mechanisms behind neuro-degenerative disorders and is recommended as part of study standards for small vessel disease research. However, due to the different appearance of lacunes in various brain regions and the existence of other similar-looking structures, such as perivascular spaces, manual annotation is a difficult, elaborative and subjective task, which can potentially be greatly improved by reliable and consistent computer-aided detection (CAD) routines. In this paper, we propose an automated two-stage method using deep convolutional neural networks (CNN). We show that this method has good performance and can considerably benefit readers. We first use a fully convolutional neural network to detect initial candidates. In the second step, we employ a 3D CNN as a false positive reduction tool. As the location information is important to the analysis of candidate structures, we further equip the network with contextual information using multi-scale analysis and integration of explicit location features. We trained, validated and tested our networks on a large dataset of 1075 cases obtained from two different studies. Subsequently, we conducted an observer study with four trained observers and compared our method with them using a free-response operating characteristic analysis. Shown on a test set of 111 cases, the resulting CAD system exhibits performance similar to the trained human observers and achieves a sensitivity of 0.974 with 0.13 false positives per slice. A feasibility study also showed that a trained human observer would considerably benefit once aided by the CAD system. Keywords: lacunes, automated detection, convolutional neural networks, deep learning, multi-scale, location-aware 1. Introduction Lacunes of presumed vascular origin (lacunes), also referred to as lacunar strokes or silent brain infarcts, are frequent imaging features on scans of elderly patients and are associated with an increased risk of stroke, gait impairment, and dementia [1–4]. Lacunes are presumed to be caused by either symptomatic or silent small sub- cortical infarcts, or by small deep haemorrhages [5] and together with white matter hyperintensities, microbleeds, perivascular spaces and brain atrophy are known to be imaging biomarkers that signify the small vessel disease (SVD) [6]. Lacunes are defined as round or ovoid subcortical fluid- filled cavities of between 3 mm and about 15 mm in diameter with signal intensities similar to cerebrospinal 1 Email: [email protected], phone: +31 243655793, fax: +31 24 3652289 fluid (CSF) [7]. On fluid-attenuated inversion recovery (FLAIR) images, lacunes are mostly represented by a central CSF-like hypointensity with a surrounding hyperintense rim; although the rim may not always be present [7]. In some cases, the central cavity is not suppressed on the FLAIR image and hence the lesion might appear entirely hyperintense, while a clear CSF-like intensity appears on other sequences such as T1-weighted or T2-weighted MR images [8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed vascular origin as part of analysis standards for neuroimaging features of SVD studies. However, this is known to be a challenging highly subjective task since the lacunes can be difficult to differentiate from the perivascular spaces, another SVD imaging feature. Perivascular spaces are also areas filled by cerebrospinal fluid, that even though they are often smaller than 3 mm, they could enlarge up to 10 to 20 mm [7]. Although perivascular spaces naturally Preprint submitted to Elsevier November 1, 2016 arXiv:1610.07442v2 [cs.CV] 29 Oct 2016

Transcript of Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF...

Page 1: Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF fileorT2-weightedMRimages[8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed

Deep Multi-scale Location-aware 3D Convolutional Neural Networks for AutomatedDetection of Lacunes of Presumed Vascular Origin

Mohsen Ghafoorian1a,b, Nico Karssemeijerb, Tom Heskesa, Mayra Bergkampc, Joost Wissinkc, Jiri Obelsb, KarlijnKeizerc, Frank-Erik de Leeuwc, Bram van Ginnekenb, Elena Marchioria, Bram Platelb

aInstitute for Computing and Information Sciences, Radboud University, Nijmegen, The NetherlandsbDiagnostic Image Analysis Group, Department of Radiology and Nuclear Medicine, Radboud University Medical Center, Nijmegen, The

NetherlandscDonders Institute for Brain, Cognition and Behaviour, Department of Neurology, Radboud University Medical Center, Nijmegen, The

Netherlands

Abstract

Lacunes of presumed vascular origin (lacunes) are associated with an increased risk of stroke, gait impairment, anddementia and are a primary imaging feature of the small vessel disease. Quantification of lacunes may be of greatimportance to elucidate the mechanisms behind neuro-degenerative disorders and is recommended as part of studystandards for small vessel disease research. However, due to the different appearance of lacunes in various brain regionsand the existence of other similar-looking structures, such as perivascular spaces, manual annotation is a difficult,elaborative and subjective task, which can potentially be greatly improved by reliable and consistent computer-aideddetection (CAD) routines.

In this paper, we propose an automated two-stage method using deep convolutional neural networks (CNN). We showthat this method has good performance and can considerably benefit readers. We first use a fully convolutional neuralnetwork to detect initial candidates. In the second step, we employ a 3D CNN as a false positive reduction tool. As thelocation information is important to the analysis of candidate structures, we further equip the network with contextualinformation using multi-scale analysis and integration of explicit location features. We trained, validated and tested ournetworks on a large dataset of 1075 cases obtained from two different studies. Subsequently, we conducted an observerstudy with four trained observers and compared our method with them using a free-response operating characteristicanalysis. Shown on a test set of 111 cases, the resulting CAD system exhibits performance similar to the trained humanobservers and achieves a sensitivity of 0.974 with 0.13 false positives per slice. A feasibility study also showed that atrained human observer would considerably benefit once aided by the CAD system.

Keywords: lacunes, automated detection, convolutional neural networks, deep learning, multi-scale, location-aware

1. Introduction

Lacunes of presumed vascular origin (lacunes), alsoreferred to as lacunar strokes or silent brain infarcts, arefrequent imaging features on scans of elderly patientsand are associated with an increased risk of stroke, gaitimpairment, and dementia [1–4]. Lacunes are presumedto be caused by either symptomatic or silent small sub-cortical infarcts, or by small deep haemorrhages [5] andtogether with white matter hyperintensities, microbleeds,perivascular spaces and brain atrophy are known to beimaging biomarkers that signify the small vessel disease(SVD) [6].Lacunes are defined as round or ovoid subcortical fluid-filled cavities of between 3 mm and about 15 mm indiameter with signal intensities similar to cerebrospinal

1Email: [email protected], phone: +31243655793, fax: +31 24 3652289

fluid (CSF) [7]. On fluid-attenuated inversion recovery(FLAIR) images, lacunes are mostly represented bya central CSF-like hypointensity with a surroundinghyperintense rim; although the rim may not always bepresent [7]. In some cases, the central cavity is notsuppressed on the FLAIR image and hence the lesionmight appear entirely hyperintense, while a clear CSF-likeintensity appears on other sequences such as T1-weightedor T2-weighted MR images [8].Wardlaw et al. [7] propose measurements of the numberand location of lacunes of presumed vascular origin aspart of analysis standards for neuroimaging features ofSVD studies. However, this is known to be a challenginghighly subjective task since the lacunes can be difficultto differentiate from the perivascular spaces, anotherSVD imaging feature. Perivascular spaces are also areasfilled by cerebrospinal fluid, that even though they areoften smaller than 3 mm, they could enlarge up to 10to 20 mm [7]. Although perivascular spaces naturally

Preprint submitted to Elsevier November 1, 2016

arX

iv:1

610.

0744

2v2

[cs

.CV

] 2

9 O

ct 2

016

Page 2: Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF fileorT2-weightedMRimages[8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed

lack the hyperintense rim, such a rim could also surroundperivascular spaces when they pass through an area ofwhite matter hyperintensity [9].Considering the importance, difficulty and hence potentialsubjectivity of the lacune detection task, assistance froma computer-aided detection (CAD) system may increaseoverall user performance. Therefore, a number of auto-mated methods have been proposed:Yokoyama et al. [10] developed two separate methods fordetection of isolated lacunes and lacunes adjacent to theventricles, using threshold-based multiphase binarizationand a top-hat transform respectively. Later on, Uchiyamaet al. employed false positive reducers on top of thepreviously mentioned method, describing each candidatewith 12 features accompanied with a rule-based and asupport vector machine classifier [11] or alternatively arule-based and a three-layered neural network followedby an extra modular classifier [12]. In another studyUchiyama et al. used six features and a neural network fordiscriminating lacunes from perivascular spaces [13, 14].They also showed that the performance of radiologistswithout a CAD system could be improved once the CADsystem detections were exposed to the radiologists [15].Another false positive reduction method using templatematching in the eigenspace was recently utilized by thesame group [16]. Finally, Wang et al. [17] detect lacunesby dilating the white matter mask and using a rule-basedpruning of false positives considering their intensity levelscompared to the surrounding white matter tissue.

Deep neural networks [18, 19] are biologically in-spired learning structures and have so far claimed humanlevel or super-human performances in several differentdomains [20–24]. Recently deep architectures and inparticular convolutional neural networks (CNN) [25]have attracted enormous attention also in the medicalimage analysis field, given their exceptional ability tolearn discriminative representations for a large variety oftasks. Therefore a recent wave of deep learning basedmethods has appeared in various domains of medicalimage analysis [26], including neuro-imaging tasks such asbrain extraction [27], tissue segmentation [28, 29], tumorsegmentation [30, 31], microbleed detection [32] and brainlesion segmentation [33–37].

In this paper, we propose a two-stage application ofdeep convolutional networks for the detection of lacunes.We use a fully convolutional network [38] for candidatedetection and a 3D convolutional network for false posi-tive reduction. Since the anatomical location of imagingfeatures is of importance in neuro-image analysis (e.g. forthe detection of WMHs [39]), we equip the CNN withmore contextual information by performing multi-scaleanalysis as well as adding explicit location information tothe network. To evaluate the performance of our proposedmethod and compare it to trained human observers, weperform an observer study on a large test set of 111

subjects with different underlying disorders.

2. Materials

Data for training and evaluation of our method comesfrom two different studies: the Radboud University Ni-jmegen Diffusion tensor and Magnetic resonance imagingCohort (RUNDMC) and the Follow-Up of transient is-chemic attack and stroke patients and Unelucidated Riskfactor Evaluation study (FUTURE). The RUNDMC [40]investigates the risk factors and clinical consequences ofSVD in individuals 50 to 85 years old without dementiaand the FUTURE [41] is a single-centre cohort study onrisk factors and prognosis of young patients with eithertransient ischemic attack, ischemic stroke or hemorrhagicstroke. We collected 654 and 421 MR images from theRUNDMC and the FUTURE studies respectively, sum-ming up to 1075 scans in total.

2.1. Magnetic Resonance Imaging

For each subject we used a 3D T1 magnetization-prepared rapid gradient-echo (MPRAGE) with voxel sizeof 1.0×1.0×1.0 mm and a FLAIR pulse sequence withvoxel size 1.0×1.2×3.0 mm (including a slice gap of 0.5mm).

2.2. Training, Validation and Test Sets

We randomly split the total 1075 cases into three sets ofsize 868, 96 and 111 scans for training, validation and testpurposes respectively.

2.3. Reference Annotations

Lacunes were delineated for all the images in the train-ing and validation sets in a slice by slice manner by twotrained raters (one for the RUNDMC and another for theFUTURE dataset), following the definitions provided inthe SVD neuro-imaging study standards [7].

2.4. Preprocessing

We performed the following pre-processing steps beforesupplying the data to our networks.

2.4.1. Image Registration

Due to possible movement of patients during scanning, theimage coordinates of the T1 and FLAIR modalities mightnot represent the same location. Thus we performed arigid registration of T1 to FLAIR image for each subject,by optimizing the mutual information with trilinear in-terpolation resampling. For this purpose, we used FSL-FLIRT [42]. Also to obtain a mapping between patient

2

Page 3: Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF fileorT2-weightedMRimages[8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed

20:26x26 20:13x13 40:9x9

7x7 Conv

2x2 Maxpool

5x5 Conv

80:7x7

3x3 Conv

110:5x5

3x3 Conv 3

00

20

0

Lacu

ne

No

lacu

ne

Figure 1: CNN architecture for candidate detection.

space and an atlas space, all subjects were non-linearlyregistered to the ICBM152 atlas [43] using FSL-FNIRT[44].

2.4.2. Brain Extraction

To extract the brain and exclude other structures, such asskull, eyes, etc., we applied FSL-BET [45] on T1 images.The resulting masks were then transformed using regis-tration transformations and were applied to the FLAIRimages.

2.4.3. Bias Field Correction

We applied FSL-FAST [46], which uses a hidden Markovrandom field and an associated expectation-maximizationalgorithm to correct for spatial intensity variations causedby RF inhomogeneities.

3. Methods

Our proposed CAD scheme consists of two phases, a can-didate detector and a false positive reducer, for both ofwhich, we employ convolutional neural networks. The de-tails for each subproblem are expanded in the followingsubsections.

3.1. Candidate Detection

As a suitable candidate detector, a method should be fast,highly sensitive to lacunes, while keeping the number ofcandidates relatively low. To achieve these, we formulatedthe candidate detection as a segmentation problem andused a CNN for this segmentation task. A CNN wouldlikely satisfy all the three criteria above: CNNs have shownto be great tools for learning discriminative representationof the input pattern. Additionally, once CNNs are formu-lated in a fully convolutional form [38], they can also bevery fast in providing dense predictions for image segmen-tation (in order of a few seconds for typical brain images).

3.1.1. Sampling

We captured 51×51 patches to describe a local neighbor-hood of each voxel we took as a sample, from both theFLAIR and T1 images. As positive samples, we pickedall the voxels in the lacune masks and augmented themby flipping the patch horizontally. We randomly samplednegative patches within the brain mask, twice as many aspositive patches. This procedure resulted in a dataset of320k patches for training.

3.1.2. Network Architecture and Training Procedure

As depicted in Figure 1, we used a seven layers CNN thatconsisted of four convolutional layers that have 20, 40, 80and 110 filters of size 7×7, 5×5, 3×3, 3×3 respectively.We applied only one pooling layer of size 2×2 with astride of 2 after the first convolutional layer since pool-ing is known to result in a shift-invariance property [47],which is not desired in segmentation tasks. Then we ap-plied three layers of fully connected neurons of size 300, 200and 2. Finally, the resulting responses were turned intolikelihood values using a softmax classifier. We also usedbatch-normalization [48] to accelerate the convergence byreducing the internal covariate shift.

For training the network, we used the stochastic gradientdescent algorithm [49] with the Adam update rule [50],mini-batch size of 128 and a categorical cross-entropy lossfunction. The non-linearity applied to neurons was a rec-tified linear unit (RELU) to prevent the vanishing gradi-ent problem [51]. We initialized the weights with the Hemethod [21], where the weights are randomly drawn froma (0,

√2

fanin) Gaussian distribution. Since CNNs are com-

plex architectures, they are prone to overfit the data veryearly. Therefore in addition to the batch normalization,we used drop-out [52] with 0.3 probability on all fully con-nected layers as well as L2 regularization with λ2=0.0001.We used an early stopping policy by monitoring validationperformance and picked the best model with the highestaccuracy on the validation set.

3

Page 4: Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF fileorT2-weightedMRimages[8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed

(a) Original FLAIRimage

(b) Candidatesegmentation

(c) Candidateextraction

Figure 2: An illustrated example on extracting lacune can-didates from the (possibly attached) segmentations.

3.1.3. Fully Convolutional Segmentation and CandidateExtraction

A sliding window patch-based segmentation approach isslow since independently convolving the correspondingpatches of neighboring voxels imposes a highly redundantprocessing. Therefore we utilized a fully convolutional ap-proach for our lacune segmentation. Although the CNNexplained in subsection 3.1.2 was trained with patches,we can reformulate the trained fully connected layers intoequivalent convolutional filter counterparts [38]. However,due to the presence of max pooling and convolutional fil-ters the resulting dense prediction is smaller than the orig-inal image size. Therefore we used the shift-and-stitchmethod [38] to up-sample the dense predictions into a full-size image segmentation.

A possible coarser segmentation of the candidates mightlead to attachment of the segments for two or more close-by candidates. To recover the possibly attached segmentsinto corresponding candidates representative points, weperformed a local maxima extraction with a sliding 2D10×10 window on the likelihoods provided by the CNN(see Figure 2), followed by a filtering of the local maximathat had a likelihood lower than 0.1. This threshold valuewas optimized for a compromise between sensitivity andnumber of extracted candidates on the validation set (0.93sensitivity with 4.8 candidates per slice on average).

3.2. False Positive Reduction

We trained a 3D CNN to classify each detected candidateas either a lacune or a false positive. Contextual informa-tion plays an important role for the task at hand as one ofthe most challenging problems for detection of lacunes, isthe differentiation between lacunes and enlarged perivas-cular spaces. Since perivascular spaces prominently occurin the basal ganglia, location information can be used as apotentially effective discriminative factor. Therefore sim-ilar to [33], we employ two mechanisms to provide thenetwork with contextual information: multi-scale analysis

and integration of explicit location features into the CNN.Integration of explicit location features has also been re-cently tried on scoring of coronary calcium in cardiac CT[53, 54], however we add the features to the first fully con-nected layer that is shown to be more effective [33]. Thesemechanisms will be explained in 3.2.2.

3.2.1. Sampling

We captured 3D patches surrounding each candidate atthree different scales: 32×32×5, 64×64×5 and 128×128×5from the FLAIR and T1 modalities, which form the dif-ferent channels of the input. We down-sample the twolarger scale patches to correspond in size with the smallerscale (32×32×5). This is motivated by the main aim ofthe larger scale patches to provide general contextual in-formation and not the details, which is supplied by smallerscale patch.We used all the lacunes as positive samples and augmentedthem with cropping all possible 32×32 patches from alarger 42×42 neighborhood and also by horizontally flip-ping the patches. This yielded an augmentation factorof 11×11×2=242. We randomly picked an equal num-ber of negative samples from non-lacune candidates. Toprevent information leakage from the augmentation oper-ation, we applied random cropping for negative samplesas well. Otherwise the network could have learned thatpatches, for which the lacune-like candidate is not locatedat the center are more likely to be positive. The createdinput patches were normalized and zero-centered. Thissampling process resulted in datasets of 385k and 35ksamples for training and validation purposes respectively.

3.2.2. Network Architecture and Training Procedure

Referring to Figure 3, we utilized a late fusion architec-ture to process the multi-scale patches. Each of the threedifferent scales streamed into stacks of 6 convolutional lay-ers with weight sharing among the streams. Each stack of6 convolutional layers consisted of 64, 64, 128, 128, 256,256 filters of size 3×3×2, 3×3×2, 3×3×1, 3×3×1, 3×3×1,3×3×1 respectively. We applied a single 2×2×1 poolinglayer after the second convolutional layer.The resulting feature maps were compressed with threeseparate fully connected layers of 300 neurons and wereconcatenated. At this stage, we embedded seven explicitlocation features to form a feature vector of size 907, whichrepresents a local appearance of the candidate at differentscales, together with information about where the candi-date is located. The seven integrated features describe foreach candidate the x, y and z coordinates of the corre-sponding location in the atlas space, and its distances toseveral brain landmarks: the right and the left ventricles,the cortex and the midsagittal brain surface. Then the re-sulting 907 neurons were fully connected to two more fullyconnected layers with 200 and 2 neurons. The resulting

4

Page 5: Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF fileorT2-weightedMRimages[8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed

64:30×30×4 64:28×28×3

3×3×2 Conv 3

00

20

0La

cun

eN

o la

cun

e

3×3×2 Conv

2×2×1 Max pool

64:14×14×3 128:12×12×3 128:10×10×3

3×3×1 Conv

3×3×1 Conv

3×3×1 Conv

3×3×1 Conv

256:8×8×3 256:6×6×3

7 location features

Candidate Detector

(Fully Conv. CNN)

Figure 3: 3D multi-scale location-aware CNN architecture for false positive reduction.

activations were finally fed into a softmax classifier. Theactivations of all the layers were batch-normalized.

The details of the training procedure were as follows:stochastic gradient descend with Adam update and mini-batch size of 128, RELU activation units with the Heweight initialization, dropout rate of 0.5 on fully con-nected layers and L2 regularization with λ2=2e-5, a de-caying learning rate with an initial value of 5e-4 and adecay factor of 2 applied at the times that the training ac-curacy dropped, training for 40 epochs, and selecting themodel that acquired the best accuracy on the validationset.

3.2.3. Test-time Augmentation

It has been reported that applying a set of augmentationsat the test time and aggregating the predictions over thedifferent variants might be beneficial [55]. Motivated bythis, we also performed test-time augmentation by meansof cropping and flipping the patches (as explained in sub-section 3.2.1) and then averaged over the predictions forthe resulting 242 variants, per sample.

3.3. Observer Study

Since an important ultimate goal for the computer-aideddiagnosis field is to establish automated methods that per-form similar to or exceed experienced human observers, weconducted an observer study, where four trained observersalso rated the test set and we compared the performance

of the CAD system with the four trained observers. Thetraining procedure was as follows: The observers had afirst session on definition of the lacunes, their appearanceson different modalities (FLAIR and T1), similar look-ing other structures such as perivascular spaces and theirdiscriminating features, following the conventions definedin the established standards in SVD research [7]. Theneach observer separately rated 20 randomly selected sub-jects from the training set. In a subsequent consensusmeeting, the observers discussed the lacunes they had de-tected/missed on the mentioned set of images. After thetraining procedure, each observer independently markedthe lacunes by selecting a single representative point forthe lacunes appearances on each slice.

3.4. Experimental Setup

3.4.1. FROC

We performed a free-response operating characteristic(FROC) analysis in order to evaluate the performance ofthe proposed CAD system to compare it to the trainedhuman observers. To be more specific, for comparing theCAD system to the i-th observer, we took the observeri out, and formed an evaluation reference standard fromthe remaining three observers. We used majority voting toform the reference standard, meaning that we consideredan annotation as a lacune if at least 2 out of the 3 remain-ing observers agreed with that. For both CAD and the i-thobserver to compare with, we considered a detection as a

5

Page 6: Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF fileorT2-weightedMRimages[8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed

Table 1: Number of detected lacunes on different definitions of observers agreements and the corresponding sensitivity ofthe candidate detector on each set. The last four columns represent the reference standards that are formed by excludingeach observer and performing majority vote over the remaining observers. The candidate detector detects 4.6 candidatesper slice (213 per scan) on average.

Measure\Reference standard At least2 out of 4

At least3 out of 4

At least2 out of 3 excluding

Obs.1 Obs.2 Obs.3 Obs.4

Number of detected lacunes 92 38 76 81 51 52Candidate detector sensitivity 0.97 1 0.97 0.98 0.98 0.98

true positive, if it was closer than 3mm to a representa-tive lacune marker in the reference standard, otherwise wecounted that as a false positive. Wherever appropriate, weprovided with the FROC curves, 95% confidence intervalsobtained through bootstrapping with 100 bootstraps. Foreach bootstrap, a new set of scans was constructed usingsampling with replacement.

3.4.2. Experiments

In our experiments we first measured results regardingthe observer study, including the number of detected la-cunes by each observer, the number of lacunes in sev-eral agreement-sets, based on different definitions of agree-ment, and the performance of our candidate detector (av-erage number of produced candidates and sensitivity oneach observer agreement set). Then we evaluated and com-pared the proposed CAD system with the four availabletrained human observers using FROC analysis, followedby another FROC analysis for a feasibility study, in whichwe showed to what extent a trained human observer wouldbenefit from our proposed CAD approach, once the CADdetections are exposed to the observer. To be more spe-cific, the markers of the CAD at a certain threshold witha high specificity (0.88 sensitivity and 0.07 false positivesper slice), were shown to the observer who was then askedto check the CAD suggestions, followed by a check to addany other lacune that was missing.

Finally, we show the contribution of two of the compo-nents of our method, namely our mechanisms to integratecontextual information (the multi-scale analysis and loca-tion feature integration) and the test-time augmentation.To numerically show the contribution of the mentionedmethod components, we summarize the FROC curves witha single score defined as the average sensitivity over oper-ating points with false positives below 0.4 per slice. Weperform this analysis for the reference standards formed byagreement of at least either two or three out of the four ob-servers. For these comparisons, we also provide empiricalp-values computed based on 100 bootstraps.

4. Results

It turned out that during the observer study, observersone to four detected 64, 38, 142 and 106 lacune loca-tions respectively. Table 1 shows the number of lacunes inagreement between observers, based on different observersagreement definitions, together with the sensitivity of ourfully convolutional neural network candidate detector oneach agreement set.

Our candidate detector achieves the mentioned sensitivi-ties producing 4.6 candidates per slice (213 per scan) onaverage. Figure 4 illustrates FROC analyses of the trainedobservers compared to the corresponding FROC curves forthe CAD system, accompanied with 95% confidence inter-vals. Figure 5 depicts the difference between the perfor-mances of observer 2 with and without observation of CADmarks while detecting the lacunes.

Figure 6 provides a more general evaluation of the pro-posed CAD system using all the four observers to formthe reference standard based on majority voting (using la-cunes marked by at least 3 out of 4 observers) and alsoan indication of the contribution of each method compo-nents. Table 2 summarizes this information by reportingp-values and scores that represent average sensitivity overoperating points with false positives less than 0.4 per slice.

To provide information about typical true positives, falsepositives, and false negatives, Figure 7 illustrates the ap-pearances of the candidates for three sample cases per cat-egory on the FLAIR and T1 slices.

5. Discussion

5.1. Two-stage Approach

In this study, we used a two-stage scheme with two dif-ferent neural networks for candidate detection and falsepositive reduction tasks. The two primary motivations fornot using a single network for lacune segmentation are thefollowing: First, the used approach is more computation-ally efficient. Our much simpler candidate detector net-work first cheaply removes a vast majority of voxels that

6

Page 7: Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF fileorT2-weightedMRimages[8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed

(a) Comparison with the trained observer 1 (b) Comparison with the trained observer 2

(c) Comparison with the trained observer 3 (d) Comparison with the trained observer 4

Figure 4: FROC curves comparing the performance of different trained observers with the proposed CAD system. Thereference standards for comparing with observer i is formed with the lacunes that at least 2 out of the 3 remainingobservers agree on. Shaded area indicates 95% intervals.

are unlikely to be a lacune. Subsequently, we apply a moreexpensive 3D, multi-scale, location-aware network only onthe considerably reduced candidates space (4.6 per sliceon average). Second, capturing enough samples from themore informative, harder negative voxels that resemble la-cunes (e.g. perivascular spaces) would not be possible ina single stage, due to the resulting training dataset imbal-ance issue, which requires us to sample with a low ratefrom the large negative sample pool.

5.2. Contribution of Method Ingredients

Referring to Table 2, it turns out that providing morecontextual information using multi-scale analysis and inte-grating explicit location features is significantly improvingthe performance of the resulting CAD approach. This islikely because the appearance of lacunes varies for different

brain anatomical locations (e.g. lacunes in the cerebellumusually do not appear with a surrounding hyperintenserim), and the fact that the other similar looking struc-tures are more prominently occurring in specific locations(e.g. perivascular spaces more often appear in the basalganglia). Such strategies can be effective not only for thisparticular task, but also in other biomedical image analy-sis domains, where the anatomical location of the imagingfeatures matters.

Referring to Table 2 and Figure 6, we observed that test-time augmentation is another effective component. This islikely due to aggregating predictions on an augmented setof pattern representations of a single candidate, reducesthe chance that a single pattern in the input space is notwell discriminated by the trained neural network.

7

Page 8: Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF fileorT2-weightedMRimages[8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed

Table 2: Benefit of context aggregation (multi-scale analysis and location feature integration) and test-time augmentationfor the proposed method, analyzed for cases where the reference standard was formed by agreement of at least two orthree observers out of four. Scores represent average sensitivity over operating points with false positives less than 0.4per slice.

Measure \Reference standard agreement At least 2 out of 4 At least 3 out of 4

Score: proposed CAD 0.82 0.92Score: no context integration 0.68 0.83p-value: with vs. without context integration <0.01 0.02Score: no test augmentation 0.76 0.89p-value: with vs. without test augmentation 0.03 0.06

Figure 5: Improvement of observer 2 once shown the CADsystem detections while rating the scans.

a

5.3. Feasibility Study on Improvement of Human Ob-servers Using CAD

Figure 5 shows that a trained human observer can con-siderably improve once aided by our CAD system. Thiscan be explained by the fact that contrasted by computersystems, humans require a substantial effort for doing anexhaustive search. Therefore showing the markers that theCAD system detects to the human observer, eases the taskfor the observers and reduces the probability of missing alacune.

5.4. Comparison to Other Methods

As referred to in the introduction section, a number ofalgorithms with either a rule-based method or supervisedlearning algorithms with hand-crafted features exist. How-ever, it is not possible to objectively compare the differentmethodologies on a unified dataset as implementations ofnone of the methods are publicly available and neither arethe datasets these are applied on. Since the majority ofthe other methods also use FROC analysis, we mention

Figure 6: Contribution of different method componentsconsidering agreement of at least 3 out of 4 as the referencestandard.

here the reported results on the exclusive datasets justto provide a general idea about the performance of theother methods. Yokoyama et al. [10] report a sensitiv-ity of 90.1% with 1.7 false positives per slice on average.The three later methods by Uchiyama et al., using differ-ent false positive reduction methods, were all reported tohave a sensitivity of 0.968, with 0.76 false positives perslice for the method that used a rule-based and a supportvector machine [11], 0.3 false positives for rule-based, neu-ral network and modular classifier [12], and 0.71 for theeigenspace template matching method [16]. At an averagefalse positive of 0.13 per slice, our method detects 97.4%of the lacunes that the majority of the four observers agreeon. We should further emphasize that since the test pop-ulation’s underlying disorder, the MR imaging protocolsand the reference standard can influence the results, thisdoes not provide a fair comparison between the differentmethods. Therefore in our study we chose to compare ourautomated method to trained human observers that ratedthe same set of images.

8

Page 9: Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF fileorT2-weightedMRimages[8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 7: FLAIR (left) and T1 (right) crops for sample cases of true positives ((a)-(c)), false positives ((d)-(f)) and falsenegatives ((g)-(i)), with the reference standard formed as the majority of the four observers (at least three out of four),and a threshold of 0.6 (0.7 sensitivity and 0.02 false positives per slice).

6. Conclusion

In this study, we proposed an automated deep learningbased method that was able to detect 97.4% of the lacunesthat the majority of the four trained observers agreed onwith 0.13 false positives per slice. We showed that inte-grating contextual information, and test-time augmenta-tion are effective components of this methodology. We alsoshowed in a feasibility study that a trained observer po-tentially improves when using the presented CAD system.

7. Acknowledgments

This work was supported by a VIDI innovational grantfrom the Netherlands Organisation for Scientific Research(NWO, grant 016.126.351). The authors also would like toacknowledge Inge van Uden, Renate Arntz, Valerie Lohnerand Steffen van den Broek for their valuable contributionsto this study.

References

[1] S. E. Vermeer, M. Hollander, E. J. van Dijk, A. Hofman, P. J.Koudstaal, M. M. Breteler, Silent brain infarcts and white mat-ter lesions increase stroke risk in the general population therotterdam scan study, Stroke 34 (5) (2003) 1126–1129.

[2] M. Santos, G. Gold, E. Kövari, F. R. Herrmann, V. P. Bozikas,C. Bouras, P. Giannakopoulos, Differential impact of lacunesand microvascular lesions on poststroke depression, Stroke40 (11) (2009) 3557–3562.

[3] P. Choi, M. Ren, T. G. Phan, M. Callisaya, J. V. Ly, R. Beare,W. Chong, V. Srikanth, Silent infarcts and cerebral microbleedsmodify the associations of white matter lesions with gait andpostural stability population-based study, Stroke 43 (6) (2012)1505–1510.

[4] D. A. Snowdon, L. H. Greiner, J. A. Mortimer, K. P. Riley, P. A.Greiner, W. R. Markesbery, Brain infarction and the clinicalexpression of alzheimer disease: the nun study, Jama 277 (10)(1997) 813–817.

[5] C. Franke, J. Van Swieten, J. Van Gijn, Residual lesions oncomputed tomography after intracerebral hemorrhage., Stroke22 (12) (1991) 1530–1533.

[6] J. M. Wardlaw, What is a lacune?, Stroke 39 (11) (2008) 2921–2922.

[7] J. M. Wardlaw, E. E. Smith, G. J. Biessels, C. Cordonnier,F. Fazekas, R. Frayne, R. I. Lindley, J. T O’Brien, F. Barkhof,O. R. Benavente, et al., Neuroimaging standards for researchinto small vessel disease and its contribution to ageing and neu-rodegeneration, The Lancet Neurology 12 (8) (2013) 822–838.

[8] F. Moreau, S. Patel, M. L. Lauzon, C. R. McCreary, M. Goyal,R. Frayne, A. M. Demchuk, S. B. Coutts, E. E. Smith, Cavita-tion after acute symptomatic lacunar stroke depends on time,location, and mri sequence, Stroke 43 (7) (2012) 1837–1842.

[9] I. A. Awad, P. C. Johnson, R. Spetzler, J. Hodak, Inciden-tal subcortical lesions identified on magnetic resonance imagingin the elderly. ii. postmortem pathological correlations., Stroke17 (6) (1986) 1090–1097.

[10] R. Yokoyama, X. Zhang, Y. Uchiyama, H. Fujita, Z. Xiangrong,M. KANEMATSU, T. ASANO, H. KONDO, S. GOSHIMA,H. HOSHI, et al., Development of an automated method for thedetection of chronic lacunar infarct regions in brain mr images,IEICE transactions on information and systems 90 (6) (2007)

9

Page 10: Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF fileorT2-weightedMRimages[8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed

943–954.[11] Y. Uchiyama, R. Yokoyama, H. Ando, T. Asano, H. Kato,

H. Yamakawa, H. Yamakawa, T. Hara, T. Iwama, H. Hoshi,et al., Computer-aided diagnosis scheme for detection of lacu-nar infarcts on mr images, Academic radiology 14 (12) (2007)1554–1561.

[12] Y. Uchiyama, R. Yokoyama, H. Ando, T. Asano, H. Kato,H. Yamakawa, H. Yamakawa, T. Hara, T. Iwama, H. Hoshi,et al., Improvement of automated detection method of lacunarinfarcts in brain mr images, in: 2007 29th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety, IEEE, 2007, pp. 1599–1602.

[13] Y. Uchiyama, T. Asano, T. Hara, H. Fujita, H. Hoshi, T. Iwama,Y. Kinosada, Cad scheme for differential diagnosis of lacunar in-farcts and normal virchow-robin spaces on brain mr images, in:World Congress on Medical Physics and Biomedical Engineer-ing, September 7-12, 2009, Munich, Germany, Springer, 2009,pp. 126–128.

[14] Y. Uchiyama, T. Kunieda, T. Asano, H. Kato, T. Hara,M. Kanematsu, T. Iwama, H. Hoshi, Y. Kinosada, H. Fujita,Computer-aided diagnosis scheme for classification of lacunarinfarcts and enlarged virchow-robin spaces in brain mr images,in: 2008 30th Annual International Conference of the IEEEEngineering in Medicine and Biology Society, IEEE, 2008, pp.3908–3911.

[15] Y. Uchiyama, T. Asano, H. Kato, T. Hara, M. Kanematsu,H. Hoshi, T. Iwama, H. Fujita, Computer-aided diagnosis fordetection of lacunar infarcts on mr images: Roc analysis of ra-diologists’ performance, Journal of digital imaging 25 (4) (2012)497–503.

[16] Y. Uchiyama, A. Abe, C. Muramatsu, T. Hara, J. Shiraishi,H. Fujita, Eigenspace template matching for detection of lacu-nar infarcts on mr images, Journal of digital imaging 28 (1)(2015) 116–122.

[17] Y. Wang, J. A. Catindig, S. Hilal, H. W. Soon, E. Ting, T. Y.Wong, N. Venketasubramanian, C. Chen, A. Qiu, Multi-stagesegmentation of white matter hyperintensity, cortical and lacu-nar infarcts, Neuroimage 60 (4) (2012) 2379–2388.

[18] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature521 (7553) (2015) 436–444.

[19] J. Schmidhuber, Deep learning in neural networks: An overview,Neural Networks 61 (2015) 85–117.

[20] D. Cireşan, U. Meier, J. Masci, J. Schmidhuber, Multi-columndeep neural network for traffic sign classification, Neural Net-works 32 (2012) 333–338.

[21] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,arXiv preprint arXiv:1502.01852.

[22] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Clos-ing the gap to human-level performance in face verification, in:Computer Vision and Pattern Recognition (CVPR), 2014 IEEEConference on, 2014, pp. 1701–1708.

[23] D. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber,Deep neural networks segment neuronal membranes in electronmicroscopy images, in: Advances in Neural Information Pro-cessing Systems, 2012, pp. 2843–2851.

[24] D. Cireşan, J. Schmidhuber, Multi-column deep neural networksfor offline handwritten chinese character classification, arXivpreprint arXiv:1309.0261.

[25] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-basedlearning applied to document recognition, Proceedings of theIEEE 86 (11) (1998) 2278–2324.

[26] H. Greenspan, B. van Ginneken, R. M. Summers, Guest edi-torial deep learning in medical imaging: Overview and futurepromise of an exciting new technique, IEEE Transactions onMedical Imaging 35 (5) (2016) 1153–1159.

[27] J. Kleesiek, G. Urban, A. Hubert, D. Schwarz, K. Maier-Hein,M. Bendszus, A. Biller, Deep mri brain extraction: a 3d con-volutional neural network for skull stripping, NeuroImage 129(2016) 460–469.

[28] W. Zhang, R. Li, H. Deng, L. Wang, W. Lin, S. Ji, D. Shen,

Deep convolutional neural networks for multi-modality isoin-tense infant brain image segmentation, NeuroImage 108 (2015)214–224.

[29] P. Moeskops, M. A. Viergever, A. M. Mendrik, L. S. de Vries,M. J. Benders, I. Išgum, Automatic segmentation of mr brainimages with a convolutional neural network, IEEE transactionson medical imaging 35 (5) (2016) 1252–1261.

[30] S. Pereira, A. Pinto, V. Alves, C. A. Silva, Brain tumor segmen-tation using convolutional neural networks in mri images, IEEEtransactions on medical imaging 35 (5) (2016) 1240–1251.

[31] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville,Y. Bengio, C. Pal, P.-M. Jodoin, H. Larochelle, Brain tumorsegmentation with deep neural networks, Medical Image Anal-ysis.

[32] Q. Dou, H. Chen, L. Yu, L. Zhao, J. Qin, D. Wang, V. C. Mok,L. Shi, P.-A. Heng, Automatic detection of cerebral microbleedsfrom mr images via 3d convolutional neural networks, IEEEtransactions on medical imaging 35 (5) (2016) 1182–1195.

[33] M. Ghafoorian, N. Karssemeijer, T. Heskes, I. van Uden,C. Sanchez, G. Litjens, F.-E. de Leeuw, B. van Ginneken,E. Marchiori, B. Platel, Location sensitive deep convolutionalneural networks for segmentation of white matter hyperintensi-ties, arXiv preprint arXiv:1610.04834.

[34] T. Brosch, L. Y. Tang, Y. Yoo, D. K. Li, A. Traboulsee, R. Tam,Deep 3d convolutional encoder networks with shortcuts for mul-tiscale feature integration applied to multiple sclerosis lesionsegmentation, IEEE transactions on medical imaging 35 (5)(2016) 1229–1239.

[35] T. Brosch, Y. Yoo, L. Y. Tang, D. K. Li, A. Traboulsee,R. Tam, Deep convolutional encoder networks for multiple scle-rosis lesion segmentation, in: Medical Image Computing andComputer-Assisted Intervention–MICCAI 2015, Springer, 2015,pp. 3–11.

[36] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D.Kane, D. K. Menon, D. Rueckert, B. Glocker, Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesionsegmentation, arXiv preprint arXiv:1603.05959.

[37] M. Ghafoorian, N. Karssemeijer, T. Heskes, I. van Uden, F.-E. de Leeuw, E. Marchiori, B. van Ginneken, B. Platel, Non-uniform patch sampling with deep convolutional neural net-works for white matter hyperintensity segmentation, in: In-ternational Symposium on Biomedical Imaging (ISBI), IEEE,2016, pp. 1414–1417.

[38] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networksfor semantic segmentation, in: Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, 2015, pp.3431–3440.

[39] M. Ghafoorian, N. Karssemeijer, I. van Uden, F. E. de Leeuw,T. Heskes, E. Marchiori, B. Platel, Small white matter lesion de-tection in cerebral small vessel disease, in: SPIE Medical Imag-ing, International Society for Optics and Photonics, 2015, pp.941411–941411.

[40] A. G. van Norden, K. F. de Laat, R. A. Gons, I. W. van Uden,E. J. van Dijk, L. J. van Oudheusden, R. A. Esselink, B. R.Bloem, B. G. van Engelen, M. J. Zwarts, I. Tendolkar, M. G.Olde-Rikkert, M. J. van der Vlugt, M. P. Zwiers, D. G. Norris,F. E. de Leeuw, Causes and consequences of cerebral small ves-sel disease. The RUN DMC study: a prospective cohort study.Study rationale and protocol, BMC Neurol 11 (2011) 29.

[41] L. C. Rutten-Jacobs, N. A. Maaijwee, R. M. Arntz, M. E.Van Alebeek, P. Schaapsmeerders, H. C. Schoonderwaldt, L. D.Dorresteijn, S. Overeem, G. Drost, M. C. Janssen, et al., Riskfactors and prognosis of young stroke. the future study: aprospective cohort study. study rationale and protocol, BMCneurology 11 (1) (2011) 1.

[42] M. Jenkinson, S. Smith, A global optimisation method for ro-bust affine registration of brain images, Medical Image Analysis5 (2) (2001) 143–156.

[43] J. Mazziotta, A. Toga, A. Evans, P. Fox, J. Lancaster, K. Zilles,R. Woods, T. Paus, G. Simpson, B. Pike, C. Holmes, L. Collins,P. Thompson, D. MacDonald, M. Iacoboni, T. Schormann,

10

Page 11: Deep Multi-scale Location-aware 3D Convolutional Neural ... · PDF fileorT2-weightedMRimages[8]. Wardlaw et al. [7] propose measurements of the number and location of lacunes of presumed

K. Amunts, N. Palomero-Gallagher, S. Geyer, L. Parsons,K. Narr, N. Kabani, G. Le Goualher, J. Feidler, K. Smith,D. Boomsma, H. Hulshoff Pol, T. Cannon, R. Kawashima,B. Mazoyer, A four-dimensional probabilistic atlas of the humanbrain, Journal of the American Medical Informatics Association8 (5) (2001) 401–430.

[44] M. Jenkinson, C. F. Beckmann, T. E. Behrens, M. W. Woolrich,S. M. Smith, Fsl, Neuroimage 62 (2) (2012) 782–790.

[45] S. M. Smith, Fast robust automated brain extraction, HumanBrain Mapping 17 (3) (2002) 143–155.

[46] Y. Zhang, M. Brady, S. Smith, Segmentation of brain mr im-ages through a hidden markov random field model and theexpectation-maximization algorithm, Medical Imaging, IEEETransactions on 20 (1) (2001) 45–57.

[47] D. Scherer, A. Müller, S. Behnke, Evaluation of pooling op-erations in convolutional architectures for object recognition,in: Artificial Neural Networks–ICANN 2010, Lecture Notes inComputer Science (LNCS 6354), Springer, 2010, pp. 92–101.

[48] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift, in: Pro-ceedings of The 32nd International Conference on MachineLearning, 2015, pp. 448–456.

[49] L. Bottou, Large-scale machine learning with stochastic gra-dient descent, in: Proceedings of COMPSTAT’2010, Springer,2010, pp. 177–186.

[50] D. Kingma, J. Ba, Adam: A method for stochastic optimization,arXiv preprint arXiv:1412.6980.

[51] A. L. Maas, A. Y. Hannun, A. Y. Ng, Rectifier nonlinearities im-prove neural network acoustic models, in: Proc. ICML, Vol. 30,2013.

[52] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,R. Salakhutdinov, Dropout: A simple way to prevent neuralnetworks from overfitting, The Journal of Machine LearningResearch 15 (1) (2014) 1929–1958.

[53] J. M. Wolterink, T. Leiner, B. D. de Vos, R. W. van Hamersvelt,M. A. Viergever, I. Išgum, Automatic coronary artery cal-cium scoring in cardiac {CT} angiography using paired con-volutional neural networks, Medical Image Analysis 34 (2016)123 – 136, special Issue on the 2015 Conference on Medical Im-age Computing and Computer Assisted Intervention. doi:http://dx.doi.org/10.1016/j.media.2016.04.004.

[54] J. M. Wolterink, T. Leiner, M. A. Viergever, I. Išgum, Auto-matic coronary calcium scoring in cardiac ct angiography us-ing convolutional neural networks, in: International Conferenceon Medical Image Computing and Computer-Assisted Interven-tion, Springer, 2015, pp. 589–596.

[55] I. Sato, H. Nishimura, K. Yokoi, Apac: Augmented pat-tern classification with neural networks, arXiv preprintarXiv:1505.03229.

11