6.2 Background

Post on 17-Nov-2021

1 views 0 download

Transcript of 6.2 Background

Chapter6

ApplicationsoftheESPNetarchitectureinmedicalimaging

SachinMehta1

NicholasNuechterlein1

EzgiMercan2

BeibinLi1

ShimaNofallah1

WenjunWu1

XimingLu1

AnatCaspi1

MohammadRastegari (Changethisto1.)3

JoannElmore4

HannanehHajishirzi1

LindaShapiro1

1UniversityofWashington,Seattle,WA,UnitedStates

2SeattleChildren’sHospital,Seattle,WA,UnitedStates

3AllenInstituteofArtificialIntelligence(AI2)andXNOR.AI

4UniversityofCalifornia,LosAngeles,CA,UnitedStates

6.1IntroductionMedicalimagingcreatesvisualrepresentationsofthehumanbody,includingorgansandtissues,toaidindiagnosisandtreatment.Thesevisualrepresentationsareeffectiveinavarietyofmedicalsettingsand

havebecomeintegralinclinicaldecisionmaking.AcommonexampleisX-ray-basedradiographyexaminationsthatareusedtocapturevisualrepresentationsoftheinteriorstructureofthebody.Otherexamples

includePAPsmearanalysisforcervicalcancerscreeningandwholeslidebiopsyanalysisformultiplekindsofcancer,includingbreastcancerandmelanoma.Today’shumanphysicianisnolongerabletointerpretthe

vastamountofinformationnowavailableinmedicalimaging.Forexample,therearehundredsofthousandsofcellsonsomeofthewholeslideimages(WSIs)ofabiopsy.

Machinelearningisaneffectivetoolthatenablesmachines(orcomputers)tolearnmeaningfulpatternsfrommedicalimagingdata,whichcanbeusedtobuildcomputer-aideddiagnosticsystems[1,2].With

the recent advancements inhardware technology and the availability of a large amount ofmedical imagingdata, deep learning-basedmethods, especially convolutional neural networks (CNNs) [3], aregaining

attention inmedical imageanalysis [4,5]. Researchers have applieddeep learning-basedmethods to a variety ofmedical data, such asWSIs [6],magnetic resonance (MR) tumor scans [7], electronmicroscopic

recordings[8],andtasks,suchascancerdiagnosis[2,9]andcellular-levelentitydetection[8,10].

In this chapter, we will describe the ESPNet architecture [11] that has been successfully applied across a variety of visual recognition tasks, including image classification, object detection, semantic

segmentation, andmedical imageanalysis [6,12,13]. To demonstrate themodeling power of theESPNet architecture,we study the application of ESPNet to two differentmedical imaging tasks: (1) tissue-level

segmentationofbreastbiopsyWSIsand(2)tumorsegmentationin3DbrainMRimages.OurresultsshowthattheESPNetarchitecturelearnsmeaningfulrepresentationsefficientlythatallowsittodelivergood

accuracyondifferenttasks.

Therestofthechapterisorganizedasfollows.Section6.2reviewsthedifferenttypesofconvolutionsthatareusedintheESPNetarchitecture,whichisdescribedinSection6.3.Experimentalresultsontwo

differentmedicalimagingdatasetsareprovidedinSection6.4.Section6.5concludesthechapter.

6.2BackgroundTheconvolutionoperationliesattheheartofmanycomputervisionalgorithms,includingCNNs.Inthissection,wewillbrieflyreviewtwotypesofconvolutions,standardanddilatedconvolutions,whichare

usedintheESPNetarchitecture.

6.2.1StandardconvolutionForagiven2Dinputimage ,thestandardconvolutionmovesakernel ofsizen×n1overeveryspatiallocationoftheinput toproducetheoutput .Mathematically,itcanbedefinedas:

where*denotestheconvolutionoperations.

6.2.2DilatedconvolutionDilated convolutions2 are a special form of standard convolutions in which holes are inserted between kernel elements to increase the effective receptive field. These convolutions have been widely used in semantic

segmentation networks [14,15]. For a given 2D input image , the dilated convolutionmoves a kernel of size with a dilation rate of over every spatial location of the input to produce the output .

Mathematically,itcanbedefinedas:

where denotes thedilatedconvolutionoperationwithadilation rateof .Theeffective receptive fieldofadilatedconvolutionalkernel ofsize with a dilation rate of is .Note that

dilatedconvolutionsarethesameasstandardconvolutionswhenthedilationrate is1.AnexamplecomparingdilatedandstandardconvolutionsisshowninFig.6.1.

6.3TheESPNetarchitectureInthissection,weelaborateonthedetailsoftheESPNetarchitecture.WefirstdescribetheESPmodule,thecorebuildingblockoftheESPNetarchitecture,andthendescribedifferentvariantsoftheESPNet

architectureforanalyzingdifferenttypesofmedicalimages,suchaswholeslidebiopsyimagesand3Dbraintumorimages.

6.3.1EfficientspatialpyramidunitThecorebuildingblockoftheESPNetarchitectureistheefficientspatialpyramid(ESP)unit.Tobeefficient,theESPunitdecomposesthestandardconvolutionintoapoint-wiseconvolutionandspatialpyramidofdilated

Figure6.1Acomparisonbetweenstandardanddilatedconvolutionoperations.InputXispaddedwithzeros(whitecells)sothattheresultingoutputaftertheconvolutionoperationisofthesamesizeasthatoftheinput.

convolutionsusingtheRSTM(reduce,splitandtransform,andmerge)principle:

• Reduce:TheESPunitapplies point-wise3(or )convolutionstoaninputtensor toproduceanoutputtensor ,where and representthewidthandheightofthetensor,whereas and represent

thenumberofchannelsintheinputandoutputtensor.

• SplitandTransform:Tolearnthespatialrepresentationsfromalargereceptivefieldefficiently,theESPunitapplies , dilatedconvolutionssimultaneouslywithdifferentdilationrates to toproduceoutput

, .Tolearnspatialrepresentationsfromalargereceptivefield,theESPunitusesadifferentdilationrate, ,ineachbranch.ThisallowstheESPunittolearnspatialrepresentationsfromaneffective

receptivefieldof ,where denotesthekernelsize.

• Merge:TheESPunitconcatenatesthesed-dimensionalfeaturemapstoproduceaM-dimensionalfeaturemap ,where representstheconcatenationoperationand .

Fig.6.2comparestheESPunitwiththestandardconvolution.TheESPunitlearns parametersandperforms operations.Comparedto parametersand operationsforthe

standardconvolution,theRSTMprinciplereducesthetotalnumberofparametersandoperationsbyafactorof ,whilesimultaneouslyincreasingtheeffectivereceptivefieldbyapproximately .

6.3.1.1HierarchicalfeaturefusionfordegriddingintheefficientspatialpyramidunitDilatedconvolutionsinsertzeros,controlledbythedilationrate,betweenkernelelementstoincreasethereceptivefieldofthekernel.However,thisintroducesunwantedcheckerboardorgriddingartifactsontheoutput,as

showninFig.6.3.TheESPunitintroducesahierarchicalfeaturefusion(HFF)methodtoremovetheseartifactsinacomputationallyefficientmanner4.HFFhierarchicallyaddsoutputs andthenconcatenatethemtoproduceahigh-

dimensionalfeaturemap ,where⊕representstheelement-wiseadditionoperation.Fig.6.4visualizestheESPunitwithandwithoutHFF.

Figure6.2AcomparisonbetweenthestandardconvolutionandtheESPunit.IntheESPunit,thestandardconvolutionisdecompose (Weonlyrefertowhitecolorinthisfigurecaption,whichshouldbeokayforbothcolorandB&Wprints.SameforFigure6.1)dintoa

point-wiseconvolutionandaspatialpyramidofdilatedconvolutionsusingtheRSTMprinciple.Notethatindilatedconvolutions,zeros(representedinwhite)areinsertedbetweenthekernelelementstoincreasethereceptivefield.

Figure6.3Thisfigureillustratesagriddingartifactexample,wherea5×5inputwithsingle-activepixel(representedinblue)isconvolvedwitha3×3dilatedconvolutionkerneltoproduceanoutputwithagriddingartifact.Activekernelelementsarerepresentedinora (Thisfigure

illustratesagriddingartifactexample,wherea5x5inputwithasingle-activepixelisconvolvedwitha3x3dilatedconvolutionkerneltoproduceanouputwithagriddingartifact.Activeinput,output,andkernelelementsareshownincolor.)nge.

6.3.2SegmentationarchitectureMostimagesegmentationarchitecturesfollowanencoder–decoderstructure[16].Theencodernetworkisastackofencodingunits,suchasthebottleneckunitinResNet[17],anddownsamplingunitsthathelpthenetwork

learnmultiscalerepresentations.Spatialinformationislostduringfilteringanddownsamplingoperationsintheencoder;thedecodertriestorecoverthelossofthisinformation.Thedecodercanbeviewedasaninverseoftheencoder

thatstacksupsamplinganddecodingunitstolearnrepresentationsthatcanhelpproduceeitherbinaryormulticlasssegmentationmasks.Itisimportanttonotethatavanillaencoder–decodernetworkdoesnotshareinformation

betweenencodinganddecodingunitsateachspatiallevel.Toshareinformationbetweenencodinganddecodingunitsateachspatiallevel,U-Net[8]introducesaskipconnectionbetweentheencodinganddecodingunitsateach

spatiallevel.Thisconnectionestablishesadirectlinkbetweentheencodingandthedecodingunitsandimprovesgradientflow,thusimprovingsegmentationperformance.Fig.6.5comparesthevanillaencoder–decoderandU-Net

styleencoder–decoderarchitectures.

The ESPNet architecture for semantic segmentation extends the U-Net. In contrast to using computationally expensive VGG-style [18] blocks for encoding and decoding the information, the ESPNet architecture uses

computationallyefficientESPunits.

6.4ExperimentalresultsInthissection,weprovideresultsoftheESPNetarchitectureontwodifferentmedicalimagingdatasets:(1)breastbiopsywholeslideimagingand(2)braintumorsegmentation.

6.4.1Breastbiopsywholeslideimagedataset6.4.1.1Dataset

Thebreastbiopsydatasetconsistsof240WSIswithhaematoxylinandeosin(H&E)staining[19,20].Threeexpertpathologistsindependentlyinterpretedeachofthe240casesandthenmettorevieweachcasetogetherand

provideaconsensusreferencediagnosislabel,whichweuseasthegoldstandardgroundtruthforeachcase.Additionally,theexpertpathologistsmarked428regionofinterests(ROIs)onthese240WSIsthatwererepresentativeof

theirdiagnosis.Outof these428ROIs,58weremanually segmentedbyanexpert intoeightdifferent tissue-level segmentation labels:background,benignepithelium,malignantepithelium,normal stroma,desmoplastic stroma,

secretion,blood,andnecrosis.Fig.6.6visualizessomeROIsalongwiththeirtissue-levelsegmentationlabels.Furthermore,wesplitthese58ROIsintotwoequallysizedsubsets,atrainingset(30ROIs)andatestset(28ROIs),while

keepingthedistributionofdiagnosticcategoriessimilar(seeTable6.1).Atotalof87pathologistswhowereactivelyinterpretingbreastbiopsiesintheirownclinicalpracticesparticipatedinthestudyandprovidedoneofthefour

Figure6.4BlocklevelvisualizationoftheESPunitwithoutandwithhierarchicalfeaturefusion(HFF).

Figure6.5Encoder–decoderarchitecturesforsegmentation.Colorsrepresentdifferenttypesofunitsintheencoder–decoderarachitecture (Encoder-decoderarchitecturesfortissue-levelsemanticsegmentation):encodingunitingreen,downsamplingunitinred,

upsamplingunitinorange,anddecodingunitinblue.

diagnosticlabels(benign,atypia,ductalcarcinomainsitu,andinvasivecancer)perWSIofacase.Eachpathologistprovideddiagnosticlabelsforarandomlyassignedsubsetof60patients’WSIs,producinganaverageof22diagnostic

labelspercase.

Table6.1Diagnosticcategory-wisedistributionof58regionofinterests(ROIs)forthesegmentationsubset.

#ROI AverageROIsize(inpixels)

Expertconsensusreferencediagnosticcategory Training Test Total

Benign 4 5 9

Atypia 11 11 22

DCIS 12 10 22

Invasive 3 2 5

Total 30 28 58

6.4.1.2TrainingThebreastbiopsyROIs(orWSIs)havehugespatialdimensions,oftenoftheorderofgigapixels.Duetosuchhighspatialdimensionalityoftheseimagesandthelimitedcomputationalcapabilityofexistinghardwareresources,

itisdifficulttotrainconventionalCNNsdirectlyontheseimages.Thereforewefollowapatch-basedapproachthatsplitsaROI(orWSI)intosmallpatchesofsize384×384withanoverlapof56pixelsbetweenconsecutivepatches.We

usestandardaugmentationstrategiessuchasrandomflipping,cropping,andresizingtopreventoverfitting.Fortraining,wesplitthetrainingsetof30ROIsintotrainingandvalidationsubsetswitha90:10ratio.Wetrainournetwork

for100epochsusingstochasticgradientdescent(SGD)withaninitiallearningrateof0.0001,whichisdecreasedby0.5afterevery30epochs.Toevaluatethetissue-levelsegmentationperformance,weusemeanintersectionover

union(mIOU),awidelyusedmetrictoevaluatesegmentationperformance.

The shape and structure of objects of interest in the breast biopsy are variable in size, and splitting such structures into fixed size using a patch-based approach limits the contextual information; CNNs tend tomake

segmentationerrorsespeciallyattheborderofthepatch(seeFig.6.7).Toavoidsucherrors,wecentrallycropa384×384predictiontoasizeof256×256,asshowninFig.6.8.

Figure6.6ThesetofROIs(toprow)alongwiththeirtissue-levelsegmentationlabels(bottomrow)fromthedataset. (Wecan'tchangecolorsfortissuelabelsinFigure6.6.,6.7,6.8,and6.9)

Figure6.7Segmentationerrorsalongtheborderofthepatch.

6.4.1.3SegmentationresultsTables6.2and6.3summarizetheperformanceofdifferentsegmentationarchitectures.

Table6.2Impactofskipconnectionsanddifferentdecodingunits.

Skipconnection Network

parametersEncoding–decodingunits Add Concat mIOU

ESP–ESP (CouldyoupleaseeditthelayoutofTable6.2,sothatitisthesameaswesubmited?Inparticular,mergethethreerowsinColumn1(correspondingtoESP-ESPsetting),sothatreadersmaynotgetconfusedwithwhichrowisforwhichsetting) 1.95 M 35.23

✓ 1.95 M 36.19

✓ 2.25 M 38.03

ESP–PSP ✓ 2.75 M 44.03

Table6.3Comparisonwithstate-of-the-artmethods.

Method Networkparameters mIOU

Superpixel+SVM NA 25.8

SegNet 12.80 M 37.6

SegNet+additiveskipconnections 12.80 M 38.1

Multiresolution 26.03 M 44.20

Y-Net(ESP–PSP) 2.75 M 44.03

Notes:TheresultsinthefirstfourrowsaretakenfromRefs[6,11,21].

6.4.1.4SkipconnectionsSkipconnectionsbetweentheencoderandthedecoderinFig.6.5canbeconstructedusingtwooperations:(1)element-wiseadditionand(2)concatenation.TheimpactoftheseconnectionsisstudiedinTable6.2.Wecansee

that encoder–decoder networkswith skip connection constructed using concatenation operations improve the accuracy of the vanilla encoder–decoder by about 4% and that the encoder–decoder networkswith skip connections

constructedusingelement-wiseadditionoperationsimprovetheaccuracyofthevanillaencoder–decoderbyabout2%.

6.4.1.5PyramidalspatialpoolingasadecodingunitTraditionally,thedecodingunitinencoder–decoderarchitectures,suchasSegNet[16]andU-Net[8],isthesameastheencodingunit.Inourrecentwork[6,11,21],weintroducedageneralencoder–decodernetwork,sketched

inFig.6.5,thatallowstheuseofdifferentencodinganddecodingunits.WhenwereplacedtheESPunitwiththepyramidalspatialpooling(PSP)unit[22]inthedecoder,theperformanceofournetworkimprovedbyabout6%.Thisis

likelybecausethepoolingoperationsinthePSPunitallowthecaptureofbetterglobalcontextualrepresentations(Fig.6.9).

Figure6.8Centercroppingtominimizesegmentationerrorsalongpatchborder.

6.4.1.6Comparisonwithstate-of-the-artmethodsTable6.3comparestheperformancewithdifferentmethods.Wecanseethattheasymmetricencoder–decoderstructure,withtheESPastheencodingunitandthePSPasthedecodingunit,allowsustobuildalight-weight

networkthathas fewerparametersthanthenetworkofRefs[6,11,21]whiledeliveringsimilarsegmentationperformance.

6.4.1.7Tissue-levelsegmentationmasksforcomputer-aideddiagnosisTissue-levelsegmentationmasksprovideapowerfulabstractionfordiagnosticclassification.Todemonstratethedescriptivepoweroftissue-levelsegmentationmasks,weextractandstudytheimpactoftwofeatures,tissue-

distributionandstructuralfeatures [2], forafour-classbreastcancerdiagnostictask.Forthisanalysis,weusethe428ROIs.Weuseasupport-vectormachineclassifierwithapolynomialkernel ina leave-one-out-cross-validation

strategy.Wemeasure the performance in terms of sensitivity, specificity, and accuracy.Table 6.4 summarizes diagnostic classification results.We can see that simple features extracted from tissue-level segmentationmasks are

powerful,andweareabletoattainaccuraciessimilartopathologists.

Table6.4ImpactofdifferentfeaturesondiagnosticclassificationfromRef.[2].

Diagnosticfeatures Accuracy Sensitivity Specificity

Invasiveversusnoninvasive

Tissue-distributionfeature 0.94 0.70 0.95

Structuralfeature 0.91 0.49 0.96

Pathologists 0.98 0.84 0.99

AtypiaandDCISversusbenign

Tissue-distributionfeature 0.70 0.79 0.41

Structuralfeature 0.70 0.85 0.45

Pathologists 0.81 0.72 0.62

DCISversusatypia

Tissue-distributionfeature 0.83 0.88 0.78

Structuralfeature 0.85 0.89 0.80

Figure6.9ROI-wisepredictions.ThefirstrowshowsdifferentbreastbiopsyROIs,thesecondrowshowstissue-levelground-truthlabels;thethirdrowshowspredictions.ROI,regionofinterest.

Pathologists 0.80 0.70 0.82

Notes:Thebestnumbersareindicatedinbold.

6.4.2Braintumorsegmentation6.4.2.1Dataset

TheMultimodalBrainTumorSegmentationChallenge(BraTS)2018trainingsetconsistsof285multi (multi-instituitional?)institutionalpreoperativemultimodalMRtumorscans,eachconsistingofT1,postcontrastT1-weighted

(T1ce),T2,andFLAIRvolumes[23].Eachvolumeisannotatedwithvoxel labelscorrespondingtothefollowingtumorcompartments:enhancingtumor,peritumoraledema,background,andnecroticcoreandnonenhancingtumor.

Necroticcoreandnonenhancingtumorshareasinglelabel.ThesedataarecoregisteredtothestandardMNIanatomicaltemplate,interpolatedtothesameresolution,andskull-stripped.Ground-truthsegmentationsaremanually

drawnbyradiologists.Fig.6.10visualizessomeMRmodalitiesalongwiththeirvoxel-levelsegmentationlabels.

6.4.2.2TrainingMRscansarehigh-dimensional:eachofthefourMRsequencevolumeshasadimensionof .Weusestandardaugmentationstrategiessuchasrandomflipping,cropping,andresizingtopreventoverfitting.For

training,wesplitthetrainingsetof285MRscansintotrainandvalidationsubsetswith80:20ratio(228:57).Wetrainournetworkfor300epochsusingSGDwithaninitiallearningrateof0.0001anddecreaseditto0.00001after200

epochs.Weevaluatethesegmentationperformanceusinganonlineserver,thatmeasuresthesegmentationperformanceintermsoftheDicescore.Furthermore,weadapttheESPmoduleforvolume-wisesegmentationbyreplacing

thespatialdilatedconvolutionsinFig.6.4withthevolume-wisedilatedconvolutions.Formoredetails,pleaseseeRef.[13].

6.4.2.3ResultsTable6.5summarizestheresultsoftheBRATS2018onlinetestandvalidationsets.Unlikethewinningentriesfromthecompetitionthatoftenusespecificnormalizationtechniques,modelensembling,andpostprocessing

methodssuchasconditionalrandomfield(CRF)toboosttheperformance[7,24],ournetworkwasabletoattaingoodsegmentationperformancewhilelearningmerely3.8millionparameters,whichareanorderofmagnitudefewer

thanexistingmethods.Furthermore,ourvisualinspectionrevealsconvincingperformance;wedisplaysegmentationresultsoverlaidondifferentmodalitiesinFig.6.11.Itisimportanttonotethatthepredictionsmadebyournetwork

aresmoothandlacksomeofthegranularitypresentintheground-truthsegmentation.Thisislikelybecauseourmodelisnotabletolearnsuchgranulardetailswithalimitedamountoftrainingdata.Webelievethatpostprocessing

techniques,suchasCRFs,wouldhelpournetworkincapturingsuchgranulardetailsandwewillstudysuchmethodsinthefuture.

Table6.5Resultsobtainedbyourmethod[13]oftheBraTS2018onlinetestandvalidationset.

Networks Wholetumor Enhancingtumor Tumorcore

Ours—validation 0.883 0.737 0.814

Ours—test 0.850 0.665 0.782

Figure6.10ThesetofMRmodalities(left)alongwiththeirvoxel-levelsegmentationlabels(right).MR,magneticresonance.

Myronenko[24] 0.884 0.766 0.815

Notes:Wealsocomparewiththewinningentryonthetestset[24],whichusesanensembleof10modelsandspecialnormalizationtechniques.

6.4.3OtherapplicationsTheESPNetarchitectureisageneral-purposearchitectureandcanbeusedacrossawidevarietyofapplications,rangingfromimageclassificationtolanguagemodeling.Inourwork,wehaveusedtheESPNetsforimage

classification[12],objectdetection[12],semanticsegmentation[6,11,21],languagemodeling[12],andautismspectraldisorderprediction[25].

6.5ConclusionThis chapter describes the ESPNet architecture that is built using a novel convolutional unit, the ESP unit, introduced in Refs [6,11,21]. To effectively demonstrate the modeling power of the ESPNet

architectureonmedicalimaging,westudyitontwodifferentdatasets:(1)abreastbiopsyWSIdatasetand(2)abraintumorsegmentationMRimagingdataset.OuranalysisshowsthatESPNetcanlearnmeaningful

representationsfordifferentmedicalimagingdataefficiently.

AcknowledgmentResearch reported in this chapterwas supportedbyWashingtonStateDepartment ofTransportation researchgrantT1461-47,NSF III (1703166),NationalCancer Institute awards (R01CA172343,R01

CA140560,RO1CA200690,andU01CA231782),theNSFIGERTProgram(awardno.1258485),theAllenDistinguishedInvestigatorAward,SamsungGROaward,andgiftsfromGoogle,Amazon,andBloomberg.The

authorswouldalsoliketothankDr.JamenBartlettandDr.DonaldL.Weaverforhelpfuldiscussionsandfeedback.Theviewsandconclusionscontainedhereinarethoseoftheauthorsandshouldnotbeinterpretedas

necessarilyrepresentingendorsements,eitherexpressedorimplied,ofNationalCancerInstituteortheNationalInstitutesofHealth,ortheUSGovernment.

References[1]C.J.LynchandC.Liston,Newmachine-learningtechnologiesforcomputer-aideddiagnosis.s.l,Nat.Med.24,2018,1304–1305.

[2]E.Mercan,etal.,Assessmentofmachinelearningofbreastpathologystructuresforautomateddifferentiationofbreastcancerandhigh-riskproliferativelesions,JAMANetw.Open2,2019,e198777.

Figure6.11ThisfigurevisualizesthesegmentationresultsproducedbyournetworkoverlaidoneachofthefourMRmodalitiesintheBRATSdataset.MR,magneticresonance.

[3]LeCun,Y.,Bengio,Y.,Convolutionalnetworksforimages,speech,andtimeseries.VolumeThehandbookofbraintheoryandneuralnetworks.1995.

[4]G.Litjens,etal.,Asurveyondeeplearninginmedicalimageanalysis,Med.ImageAnal.42,2017,60–88.

[5]Shen,D.,Wu,G.,Suk,H.-I.,Deeplearninginmedicalimageanalysis.Annualreviewofbiomedicalengineering,2017,pp.221–248.

[6]Mehta,S.etal.,Y-net:Jointsegmentationandclassificationfordiagnosisofbreastbiopsyimages.s.l.,InternationalConferenceonMedicalImageComputingandComputer-AssistedIntervention(MICCAI),2018b,pp.893–901.

[7]Kamnitsas,K.etal.,Ensemblesofmultiplemodelsandarchitecturesforrobustbraintumoursegmentation.s.l.,InternationalMICCAIBrainlesionWorkshop.2017.

[8]Ronneberger,O.,Fischer,P.&Brox,T.,U-Net:ConvolutionalNetworksforBiomedicalImageSegmentation.s.l.,InternationalConferenceonMedicalImageComputingandComputer-AssistedIntervention(MICCAI),2015,pp.234–241.

[9]A.Esteva,etal.,Dermatologist-levelclassificationofskincancerwithdeepneuralnetworks,Nature542,2017,115–118.

[10]Cireşan,D.C.,Giusti,A.,Gambardella,L.M.,Schmidhuber,J.,MitosisDetectioninBreastCancerHistologyImageswithDeepNeuralNetworks.s.l.,InternationalConferenceonMedicalImageComputingandComputer-AssistedIntervention.2013.

[11]Mehta,S.etal.,ESPNet:Efficientspatialpyramidofdilatedconvolutionsforsemanticsegmentation.s.l.,ProceedingsoftheEuropeanConferenceonComputerVision(ECCV).2018c.

[12]Mehta,S.,Rastegari,M.,Shapiro,L.,Hajishirzi,H.,ESPNetv2:Alight-weight,powerefficient,andgeneralpurposeconvolutionalneuralnetwork.s.l.,IEEEConferenceonComputerVisionandPatternRecognition(CVPR).2019.

[13]Nuechterlein,N.,Mehta,S.,3D-ESPNetwithPyramidalRefinementforVolumetricBrainTumorImageSegmentation.s.l.,InternationalMICCAIBrainlesionWorkshop.2018.

[14]L.-C.Chen,etal.,Deeplab:semanticimagesegmentationwithdeepconvolutionalnets,atrousconvolution,andfullyconnectedCRFs,IEEETrans.PatternAnal.Mach.Intell.2017,834–848.

[15]Yu,F.,Koltun,V.,Multi-scalecontextaggregationbydilatedconvolutions.s.l.,InternationalConferenceonRepresentationLearning.2016.

[16]V.Badrinarayanan,A.KendallandR.Cipolla,Segnet:adeepconvolutionalencoder-decoderarchitectureforimagesegmentation,IEEETrans.patternAnal.Mach.Intell.2017,2481–2495.

[17]He,K.,Zhang,X.,Ren,S.,Sun,J.,DeepResidualLearningforImageRecognition.s.l.,IEEEconferenceoncomputervisionandpatternrecognition(CVPR),2016,pp.770–778.

[18]Simonyan,K.,Zisserman,A.,VeryDeepConvolutionalNetworksforLarge-ScaleImageRecognition.s.l.,InternationalConferenceonRepresentationLearning(ICLR).2015.

[19]J.G.Elmore,etal.,Diagnosticconcordanceamongpathologistsinterpretingbreastbiopsyspecimens,JAMA313,2015,1122–1132.

[20]J.G.Elmore,etal.,Arandomizedstudycomparingdigitalimagingtotraditionalglassslidemicroscopyforbreastbiopsyandcancerdiagnosis,J.Pathol.Inform.8,2017,12.

[21]Mehta,S.etal.,Learningtosegmentbreastbiopsywholeslideimages.s.l.,2018IEEEWinterConferenceonApplicationsofComputerVision(WACV),2018a,pp.663–672.

[22]Zhao,H.etal.,Pyramidsceneparsingnetwork.s.l.,IEEEconferenceoncomputervisionandpatternrecognition(CVPR),2017,pp.2881–2890.

[23]B.H.Menze,etal.,Themultimodalbraintumorimagesegmentationbenchmark(BRATS),IEEETrans.Med.Imaging34,2015,1993–2024.

[24]Myronenko,A.,3DMRIbraintumorsegmentationusingautoencoderregularization.s.l.,InternationalMICCAIBrainlesionWorkshop.2018.

[25]Li,B.etal.,AFacialAffectAnalysisSystemforAutismSpectrumDisorder.s.l.,IEEEInternationalConferenceonImageProcessing(ICIP).2019.

Footnotes1Forsimplicity,weassumethatnisanoddnumber,suchas3,5,7,andsoon.

2Dilatedconvolutionsaresometimesalsoreferredtoasatrousconvolutions,wherein“trous”meansholesinFrench.

QueriesandAnswersQuery:

Pleasecheckalltheauthornamesandaffiliations.

Answer:Yes,theauthor'snamesarecorrect.

Query:

Pleaseprovidecityandcountrynameforaffiliation3.

Answer:Changeofaffiliation3.PleaseusetheUniversityofWashingtoninsteadofAllenInstituteofArtificialintelligenceandXNOR.AI

Query:

Pleasenotethatwehavesetkeywords.Pleasecheckandamendifnecessary.

Answer:Correct

Query:

PleaseprovideexpansionforPAP,FLAIR,andMNI.

Answer:PAP:Papanicolaoutest,FLAIR:Fluidattenuatedinversionrecovery

Query:

PleasenotethatwehaveintroducedFig.6.9citationhere.Pleasecheckandamendifnecessary.

Answer:Lookscorrect

Query:

PleasealterthesignificancebothincaptionandimageofFigs.6.1–6.3,and6.5andimageofFigs.6.6and6.9toensurethattheyaremeaningfulwhenthechapterisreproducedbothincolorandB&W.

Answer:Checkourcomments

3Point-wiseconvolutionsareaspecialcaseofstandard convolutionswhen .Theseconvolutionsproduceanoutputplanebylearninglinearcombinationsbetweeninputchannels.Theseconvolutionsare

widelyusedforeitherchannelexpansionorchannelreductionoperations.

4GriddingartifactscanberemovedbyaddinganotherconvolutionlayerwithnodilationaftertheconcatenationoperationinFig.6.4(left);however,thiswillincreasethecomputationalcomplexityoftheblock.

Abstract

Medicalimagingisafundamentalpartofclinicalcarethatcreatesinformative,noninvasive,andvisualrepresentationsofthestructureandfunctionoftheinteriorofthebody.Withadvancementsintechnology

andtheavailabilityofmassiveamountsofimagingdata,data-drivenmethods,suchasmachinelearninganddatamining,havebecomepopularinmedicalimaginganalysis.Inparticular,deeplearning-basedmethods,

suchas convolutionalneuralnetworks,nowhave the requisite volumeofdataandcomputationalpower tobe consideredpractical clinical tools.Wedescribe thearchitectureof theESPNetnetworkandprovide

experimentalresultsforthetaskofsemanticsegmentationontwodifferenttypesofmedicalimages:(1)tissue-levelsegmentationofbreastbiopsywholeslideimagesand(2)3Dtumorsegmentationinbrainmagnetic

resonanceimages.OurresultsshowthattheESPNetarchitectureisefficientandlearnsmeaningfulrepresentationsfordifferenttypesofmedicalimages,whichallowsESPNettoperformwellontheseimages.

Keywords:ESPNet;wholeslideimages;magneticresonanceimages;medicalimaging;tumorsegmentation