Attention-Based Transformers for Instance Segmentation of ......measured by the cell class...

Attention-Based Transformers for InstanceSegmentation of Cells in Microstructures

Tim Prangemeier, Christoph Reich, Heinz Koeppl‡

Centre for Synthetic Biology,Department of Electrical Engineering and Information Technology, Department of Biology,

Technische Universität Darmstadt‡[email protected]

Abstract—Detecting and segmenting object instances is acommon task in biomedical applications. Examples range fromdetecting lesions on functional magnetic resonance images, to thedetection of tumours in histopathological images and extractingquantitative single-cell information from microscopy imagery,where cell segmentation is a major bottleneck. Attention-basedtransformers are state-of-the-art in a range of deep learningfields. They have recently been proposed for segmentation taskswhere they are beginning to outperforming other methods.We present a novel attention-based cell detection transformer(Cell-DETR) for direct end-to-end instance segmentation. Whilethe segmentation performance is on par with a state-of-the-artinstance segmentation method, Cell-DETR is simpler and faster.We showcase the method’s contribution in a the typical use caseof segmenting yeast in microstructured environments, commonlyemployed in systems or synthetic biology. For the specific usecase, the proposed method surpasses the state-of-the-art tools forsemantic segmentation and additionally predicts the individualobject instances. The fast and accurate instance segmentationperformance increases the experimental information yield for aposteriori data processing and makes online monitoring of exper-iments and closed-loop optimal experimental design feasible.Code and data will be made available for the conference athttps://git.rwth-aachen.de/bcs/projects/cell-detr.git.

Index Terms—attention, instance segmentation, transformers,single-cell analysis, synthetic biology, microfluidics, deep learning

I. INTRODUCTION

Instance segmentation is a common task in biomedicalapplications. It is comprised of both detecting individual objectinstances and segmenting them [1], [2]. For example, thedetection of individual entities such as tumours and cells, aswell the measurement of their shape, are important in health-care and life sciences. Recent advances in automated single-cell image processing, such as instance segmentation, havecontributed to early tumour detection, personalised medicine,biological signal transduction and insight into the mechanismsbehind cell population heterogeneity, amongst others [3]–[6].An example of instance segmentation is shown in Fig. 1, withfour separate cell and two trap microstructures detected andsegmented individually.

Object detection and panoptic segmentation are closelyrelated to instance segmentation [2], [7]. Carion et al. recentlyproposed a novel attention-based detection transformer DETRfor panoptic segmentation [8]. DETR achieves state-of-the-artpanoptic segmentation performance, while exhibiting a com-paratively simple architecture that is easier to implement and is

computationally more efficient than previous approaches [8].Its simplicity in comparison to previous approaches promisesto be beneficial for its adoption in real-world applications.

Time-lapse fluorescence microscopy (TLFM) is a powerfultechnique for studying cellular processes in living cells [4],[9], [10]. The vast amount of quantitative data TLFM yields,promises to constitute the backbone for the rational design ofde novo biomolecular functionality with quantitatively predic-tive modelling [10], [11]. Ideally in synthetic biology, wellcharacterised parts and modules are combined in silico in abottom up approach [10], [12], for example, to detect and killcancer cells [13], [14].

Quantitative TLFM with high-throughput microfluidics isan essential technique for concurrently studying the hetero-geneity and dynamics of synthetic circuitry on the singlecell level [4], [9], [10]. A typical TLFM experiment yieldsthousands of specimen images (eg. Fig. 1) requiring automatedsegmentation, examples include [5], [15], [16]. Segmentingeach individual cell enables pertinent information about eachcell’s properties to be extracted quantitatively. For example, theabundance of a fluorescent reporter molecule can be measured,giving insight into the inner workings of the cell.

traptrap

cell

cellcell

cellCell-DETR

Fig. 1. Schematic of Cell-DETR direct instance segmentation discerningindividual cell (colour) and trap microstructure (grey) objects instances.

Instance segmentation is a major bottleneck in quanti-fying single-cell microscopy data and manual analysis isprohibitively labour intensive [9], [10], [15], [17], [18]. Thevast majority of single-cell segmentation methods are de-signed for a posteriori data processing and often require post-processing for instance detection or manual input [9]. Thisis not only a drawback on the amount of experiments thatcan be performed, but also limits the type of experiments[17], [19]. For example, harnessing the potential of advancedclosed-loop optimal experimental design techniques [12], [20]requires online monitoring with fast instance segmentation

978-1-7281-6215-7/20/$31.00 ©2020 IEEE

https://git.rwth-aachen.de/bcs/projects/cell-detr.git

capabilities. Attention-based methods, such as the recentlyproposed detection transformer DETR [8], are increasinglyoutperforming other methods [8], [21]. For the yeast-trapconfiguration (Fig. 1) instance segmentation methods are yetto be employed and to the best of our knowledge, no attention-based transformers for instance segmentation exist in general.

In this study, we present Cell-DETR, a novel attention-baseddetection transformer for instance segmentation of biomedicalsamples based on DETR [8]. We address the automated cellinstance segmentation bottleneck for yeast cells in microstruc-tured environments (Fig. 1) and showcase Cell-DETR on thisapplication. Section II introduces the previous segmentationapproaches and the microstructured environment. Our experi-mental setup for fluorescence microscopy, the tested architec-tures and our approach to training and evaluation are presentedin Section III. We analyse the proposed method’s performancein Section IV and compare it to the application specificstate-of-the-art, as well as to a general instance segmentationbaseline. After interpreting the results and highlighting themethod’s future potential in Section V, we summarise andconclude the study in Section VI. Our model surpasses theprevious application baseline and is on par with a general state-of-the-art instance segmentation method. The relatively shortinference runtimes enable both higher throughput a posterioridata processing and make online monitoring of experimentswith approximately 1000 cell traps feasible.

II. BACKGROUND

An extensive body of research into the automated processingof microscopy imagery dates back to the middle of the 20-th century. Comprehensive reviews of the many methods tosegment yeast on microscopy imagery are available elsewhere[3], [19]. Here we focus on dedicated tools for segmentingcells in trapped microstructures. U-Net convolutional neu-ral networks (CNNs) with an encoder-decoder architecturebridged by skip connections have been shown to performsemantic segmentation well for E. coli mother machines [9],[18] and yeast in microstructured environments [6]. In thecase of trapped yeast, the previous state-of-the-art tool DISCO[15] was based on conventional methods (template matching,support vector machine, active contours), until recently beingsuperseded by U-Nets [6]. The current baseline for semanticsegmentation of yeast in microstructured environments, asmeasured by the cell class intersection-over-union, is 0.82[6]. Additional post-processing, of the segmentation maps isrequired to attain each individual cell instance [6].

For instance segmentation in general, more recent state-of-the-art methods are available, for example Mask R-CNN [1]. Itis a proposal-based instance segmentation model, which com-bines a CNN backbone, region proposals with non-maximum-suppression, region-of-interest (ROI) pooling, and separateprediction heads for instance segmentation [1]. Attention-based methods are increasingly outperforming convolutionalmethods and are currently state-of-the-art in natural languageprocessing, for example, the GPT-2 model [21], [22]. Be-yond natural language processing, GPT-2 also demonstrated

Fig. 2. Single cell fluorescence measurement setup. Microfluidic chip onmicroscope table (top right), microscope images and trap design of theyeast trap microstructures. The trap chamber (green rectangle) contains anarray of approximately 1000 traps. Single specimen images show a pair ofmicrostructures and fluorescent cells, violet contours indicate segmentation oftwo separate cell instances with corresponding fluorescence measurement F1and F2; black scale bar 1 mm, white scale bar 10 µm.

promising results for auto-regressive pixel prediction [22].Axial-attention modules can extract image features and arean alternative to convolutions, which they can outperformed[23]. Recently, the first transformer-based method (DETR [8])for object detection and panoptic segmentation was reported.DETR achieves state-of-the-art results on par with FasterR-CNN and constitutes a promising approach for furtherimprovements in automated object detection and segmentationperformance.

The microfluidic trap microstructures we consider here aredesigned for long-term culture of yeast cells (Saccharomycescerevisiae) within the focal plane of a microscope for opticalaccess [16]. The environment inside the microchip is tightlycontrolled and conducive to yeast growth. Examples of itsroutine employ include Fig. 2 and [4]–[6], [10], [15]. Aconstant flow of yeast growth media hydrodynamically trapsthe cells in the microstructures and allows the introductionof chemical perturbations. An automated microscope recordsan entire trap chamber of up to 1000 traps by imagingboth the brightfield and fluorescent channels at approximately20 neighbouring positions. Typical experiments each producehundreds of GB of image data. Time-lapse recordings allowindividual cells to be tracked through time. Robust instancesegmentation facilitates tracking [9], [19], which itself can bea limiting factor with regard to the data yield of an experiment[4], [9], [18], [19].

III. METHODOLOGYA. Live-cell microscopy dataset and annotations

The individual specimen images each contain a singlemicrofluidic trap and some yeast cells, as depicted in Fig.3. These are extracted from larger microscope recordings,whereby each exposure contains up to 50 traps (Fig. 2 mid-dle). Ideally, a single mother cell persists in each trap, withsubsequent daughter cells being removed by the constant flowof yeast growth media. In practice, multiple cells accumulatearound some traps, while other traps remain empty (Fig. 3).

We distinguish between three classes on the specimen imageannotations, as depicted in Fig. 3. The yeast cells in violet

brightfield background traps cells

Fig. 3. Example of class and instance annotations for a specimen image;brightfield image (left), background label in light greylight grey ��, instances of thetrap class in shades of dark greydark grey �� and instances of the cell class in shadesof violetviolet �� (left to right respectively); scale bar 10µm.

are the most important class for biological applications. Tocounteract traps being segmented as cells, we employ a distinctclass for them (dark grey). The background (light grey) isannotated for semantic segmentation training, for example ofU-Nets. For instance segmentation training we introduce a no-object class ∅ in place of the background class.

Each instance of cells or trap structures are annotated indi-vidually with a bounding box, class specification and separatepixel-wise segmentation map. Here we omit the boundingboxes to enable an unobscured view of the contours. Instead,the distinct cell instances and their individual segmentationmaps are indicated by different shades of violet in Fig. 3.

The annotated set of 419 specimen images from a selectionof various experiments was randomly assigned for networktraining, validation and testing (76 %, 12 % and 12 % re-spectively). Examples are shown in Fig. 4. Images includea balance of the common yeast-trap configurations: 1) emptytraps, 2) single cells (with daughter) and 3) multiple cells.Slight variations in trap fabrication, debris and contamination,focal shift, illumination levels and yeast morphology wereincluded. Further possible scenarios or strong variations wereomitted, such as other trap design geometries, other modelorganisms and significant focal shift.

B. The Cell-DETR instance segmentation architecture

The proposed Cell-DETR models A and B are based onthe DETR panoptic segmentation architecture [8]. We adaptedthe architecture for non-overlapping instance segmentation andreduced it in size for faster inference. The main differencesbetween DETR and our variants Cell-DETR A and B aresummarised in Table I. The Cell-DETR variants have ap-proximately one order of magnitude less parameters than theoriginal (∼ 40 × 106 reduced to ∼ 5 × 106 parameters). Themain building blocks of the Cell-DETR model are detailed inFig. 5. They are the backbone CNN encoder, the transformerencoder-decoder, the bounding box and class prediction heads,and the segmentation head.

The CNN encoder (left in Fig. 5) extracts image featuresof the brightfield specimen image input. It is based on fourResNet-like [24] blocks with 64, 128, 256 and 256 convo-lutional filters. After each block a 2 × 2 average poolinglayer is utilised to downsample the intermediated featuremaps. The Cell-DETR variants employ different activationsand convolutions, as detailed in Table I.

Fig. 4. Characteristic selection of specimen images and correspondingannotations, including empty or single trap structures, trapped single cells(with single daughter adjacent) and multiple trapped cells; trap instances inshades of dark greydark grey ��, cell instances in shades of violetviolet �� and transparentbackground; scale bar 10µm.

The transformer encoder determines the attention betweenimage features. The transformer decoder predicts the attentionregions for each of the N = 20 object queries. They arebased on the DETR architecture [8]. We reduced the numberof transformer encoder blocks to three and decoder blocks totwo, each with 512 hidden features in the feed-forward neural-network (FFNN). The 128 backbone features are flattenedbefore being fed into the transformer. In contrast to the originalDETR, we employed learned positional encodings. WhileCell-DETR A employs leaky ReLU [25] activations, Padéactivation units [26] are utilised for Cell-DETR B.

TABLE IOVERVIEW OF DIFFERENCES BETWEEN DETR, CELL-DETR A AND B.

——-Model

Activationfunctions

Convolutions Featurefusion

Param.×106

DETR [8] ReLU standardspatial

addition ' 40

C-DETR A leaky ReLU[25]

standardspatial

addition 4.3

C-DETR B Padé [26] deformable(v2) [27]

pix.-adapt.conv. [28]

5.0

The prediction heads for the bounding box and classificationare each a FFNN, mapping the transformer encoder-decoderoutput to the bounding box and classification prediction. TheseFFNN process each query in parallel and share parameters

input image

backboneCNN

encoderimage features

transformerencoder-decoder

FFNN

FFNN class & BBpredictions

cell

cell

cell

trap trap

encoder features multi-head-attention

CNNdecoder

||

skip connections

segmentationprediction

Fig. 5. Architecture of the end-to-end instance segmentation network, with brightfield specimen image input and an instance segmentation prediction as output.The backbone CNN encoder extracts image features that then feed into both the transformer encoder-decoder for class and bounding box prediction, as wellas to the CNN decoder for segmentation. The transformer encoded features, as well as the transformer decoded features, are feed into a multi-head-attentionmodule and together with the image features from the CNN backbone feed into the CNN decoder for segmentation. Skip connections additionally bridgebetween the backbone CNN encoder and the CNN decoder. Input and output resolution is 128 × 128.

over all queries. In addition to the cell and trap classes, theclassification head can also predict the no-object class ∅.

The segmentation head is composed of a multi-head atten-tion mechanism and a CNN decoder to predict the segmen-tation maps for each object instance. We employ the originalDETR [8] two-dimensional multi-head attention mechanismbetween the transformer encoder and decoder features. Theresulting attention maps are concatenated channel-wise ontothe image features and fed into the CNN decoder. The threeResNet-like decoder blocks decrease the feature channel sizewhile increasing the spatial dimensions. Long skip connectionsbridge between the CNN encoder and CNN decoder blocks’respective outputs. The features are fused by addition forCell-DETR A and pixel-adaptive convolutions for Cell-DETRB. A fourth convolutional block incorporates the queries inthe feature dimension and returns the original input’s spatialdimension (1282 pixels) for each query. Non-overlapping seg-mentation is ensured by applying a softmax over all queries.

C. Training Cell-DETR

We employ a combined loss function and a direct setprediction to train our Cell-DETR networks end-to-end. Theset prediction ŷ = {ŷi = {p̂i, b̂i, ŝi}}N=201 is comprised ofthe respective predictions for class probability p̂i ∈ RK (hereK = 3 classes, no-object, trap, cell), bounding box b̂i ∈ R4and segmentation ŝi ∈ R128×128 for each of the N queries.We assigned each instance set label yσ(i) to the correspondingquery set prediction ŷi with the Hungarian algorithm [8], [29].The indices σi denote the best matching permutation of labels.

The combined loss L is comprised of a classification lossLp, a bounding box loss Lb, and a segmentation loss Ls

L =N∑i=1

(Lp + 1{pi 6=∅}Lb + 1{pi 6=∅}Ls

),

with N = 20 object instance queries in this case. We employclass-wise weighted cross entropy for the classification loss

Lp(pσ(i), p̂i

)= −

K∑k=1

βk pσ(i),k log(p̂i,k),

with weights β = [0.5, 0.5, 1.5] for the K = 3 classes, no-object, trap and cell classes respectively. The bounding boxloss itself is composed of two weighted loss terms. These area generalised intersection-over-union LJ [30], and a L1 loss,with respective weights λJ = 0.4 and λL1 = 0.6

Lb(bσ(i), b̂i

)= λJ LJ

(bσ(i), b̂i

)+ λL1

∣∣∣∣∣∣bσ(i) − b̂i∣∣∣∣∣∣1.

The segmentation loss Ls is a weighted sum of the focalloss LF [31] and Sørensen-Dice loss LD [6], [8]

Ls(sσ(i), ŝi

)= λF LF

(sσ(i), ŝi; γ

)+ λD LD

(sσ(i), ŝi; �

).

The respective weights are λF = 0.05 and λD = 1, withfocusing parameter γ = 2 and � = 1 for numerical stability.

D. Evaluation and implementation

We employ a number of metrics to quantitatively analysethe performance of the trained networks with regard to classi-fication, bounding box and segmentation performance. Giventhe ground truth Y and the prediction Ŷ (in the correspondinginstance-matched permutation), we evaluate the segmentationperformance with variants of the Jaccard index J and theSørensen-Dice D coefficient [6], [8], omitting the background

D(Y, Ŷ) = 2|Y ∩ Ŷ||Y|+ |Ŷ|

; Jk(Yk, Ŷk) =|Yk ∩ Ŷk||Yk ∪ Ŷk|

, (1)

with Jk intuitively the intersection-over-union for each classk. With respect to the metrological application in imagecytometry, the cell class is of most importance, therefore,we consider the Jaccard index for the cell class alone (Jc).Similarly, in the case of instance segmentation, we compute

Ji for each instance i and average over all I object instancesto compute the mean instance Jaccard index J̄I = 1I

∑Ii=1 Ji.

We utilise the accuracy as the proportion of correct pre-dictions for classification. The bounding boxes are evaluatedwith the Jaccard index J̄b. It is defined analogously to theobject instance Jaccard index (compare Eqn. 1), yet computedimplicitly with the bounding box coordinates.

We compare the proposed method with our own imple-mentations of both the state-of-the-art for the trapped yeastapplication (U-Net [6]), as well as more generally with astate-of-the-art instance segmentation meta algorithm (MaskR-CNN [1]). The multiclass U-Net for semantic segmentationwas implemented in PyTorch, with the architecture, pre- andpost-processing described in [6]. We implemented a Mask R-CNN [1] with Torchvision (PyTorch) and a ResNet-18 [24]backbone, pre-trained for image classification.

We implemented the proposed Cell-DETR A and B ar-chitectures with PyTorch. We used the MMDetection toolkit[32] for deformable convolutions and the PyTorch/Cuda im-plementation for the Padé activation units [26]. We trainedthe models using AdamW [33] for optimisation with a weightdecay of 10−6. The learning rate was 10−5 for the backboneand 10−4 for the rest of the model. The learning rates weredecreased by an order of magnitude after 50 and again 100epochs of the total 200 epochs. The additional first and second-order momentum moving average factors were 0.9 and 0.999respectively. We selected the best performing model basedon the cell class Jaccard index Jc, typically after 80 to 140epochs with mini batch size 8. The training data was randomlyaugmented by elastic deformation [6], [34], horizontal flippingor by the addition of noise with a probability of 0.6. Inferenceruntimes for one forward pass were averaged over 1000 runson a Nvidia RTX 2080 Ti for all three methods (U-Net, MaskR-CNN and Cell-DETR).

E. Data acquisition setup

Yeast cells were cultured in a tightly controlled microfluidicenvironment that was conducive to continued growth. Themicrofluidic chips confined the cells to the focal plane of themicroscope. Continuous media flow hydrodynamically trapsthe living cells in the microstructures. The Polydimethylsilox-ane (PDMS) microstructures constrain the cells in XY, whileaxial constraints in Z are provided by the cover slip and PDMSceiling. The space between cover slip and the PDMS ceilingis on the order of a cell diameter to facilitate continuouslyuniform focus of the cells. A temperature of 30 °C and the flowof yeast growth media enables yeast to grow for prolongedperiods and over multiple cell-cycles.

We recored time-lapse brightfield (transmitted light) andfluorescent channel imagery of the budding yeast cells every10 min with a computer controlled microscope (Nikon EclipeTi with XYZ stage; µManager; 60x objective). A CoolLEDpE-100 and a Lumencor SpectraX light engine illuminated therespective channels, which were captured with a ORCA Flash4.0 (Hamamatsu) camera. Multiple lateral and axial positionswere recorded sequentially at each timestep (Fig. 2).

IV. RESULTS

A. Cell-DETR variant results

A sample of segmentation results for the two Cell-DETRvariants is shown in Fig. 6. The cell and trap instances are alldetected and classified correctly with slight variations in seg-mentation contours. Separate instances of cells and traps areindicated by the shades of violet and grey respectively. VariantB demonstrates slightly better segmentation performance. Aqualitative example of this is shown in in Fig. 6, where Cell-DETR A in contrast to B excludes a small section of one cell.

brightfield C-DETR A C-DETR B label

Fig. 6. Qualitative comparison of Cell-DETR A and B segmentationexamples for a selected test image (left) and label (right); trap instances inshades of dark greydark grey ��, cell instances in shades in violetviolet ��; scale bar 10µm.

The quantitative comparison between the segmentation per-formance of the Cell-DETR variants is summarised in TableII. We modified model B for better performance on ourapplication, as described in Section III-B. The mean Jaccardindex over all object instances increased from J̄I = 0.84 formodel A to J̄I = 0.85 for model B, while the cell classJaccard index increased by a similar margin from Jc = 0.83to Jc = 0.84. Taking the background into account, an accuracyof 0.96 is achieved. Both Cell-DETR surpass the segmentationperformance (Jc) of the previous state-of-the-art methods forthe trapped yeast application [6], [15], in addition to directlyattaining the instances.

TABLE IISEGMENTATION PERFORMANCE OF CELL-DETR A AND B.

———Model

SørensenDiceD

MeaninstanceJ̄I

——–Cell classJc

Seg.accuracy

C-DETR A 0.92 0.84 0.83 0.96C-DETR B 0.92 0.85 0.84 0.96

The bounding box and classification performance is sum-marised in Table III. Again, both models perform similarlywell. They correctly classify the object instances (classificationaccuracy of 1.0) and detect the correct number of instancesfor each class. They also perform similarly well at localisingthe instances, achieve a bounding box intersection-over-unionof Jb = 0.81, for the standard formulation as well as thegeneralised form employed for training.

The slight increase in segmentation performance that modelB yields is a trade off with increased computational cost. Thenumber of parameters is increased from approximately 4×106,to over 5× 106 (Table I). This leads to an increase in runtimefrom 9.0 ms for model A to 21.2 ms for model B. These times

TABLE IIIBOUNDING BOX AND CLASSIFICATION PERFORMANCE OF CELL-DETR A

AND B.

Bounding box ClassificationModel Jaccard Jb accuracy

C-DETR A 0.81 1.0C-DETR B 0.81 1.0

are orders of magnitude faster than the previous state-of-the-art method DISCO [15] and on the same order of magnitude asthe currently fastest reported network for this application [6].Runtimes on this order of magnitude suffice for in-the-loopexperimental techniques.

We select model B for further analysis, based on the im-proved performance and sufficiently fast runtimes. A selectionof segmentation predictions for the three most typical scenar-ios in the test dataset is given in Fig. 7. The detection of celland trap instances, without any overlap between instances, issuccessful for single cells (middle row), multiple cells (bottomrow), and empty traps are correctly identified. The introductionof multiple classes (traps, cells), as well as individual objectinstances facilitated individually segmenting each cell anddiscerning these from both the traps and other cells.

brightfield overlay pred. mask label

Fig. 7. Example of different scenarios from the test dataset segmented withCell-DETR B: an empty trap (top row), a single trapped cell (middle row)and multiple cells; columns are brightfield, an overlay of the prediction, theprediction mask and the ground truth label (left to right respectively). Coloursindicate traps in shades of greygrey �� and cell instances in shades of violetviolet ��;scale bar 10µm.

The intended application of our method is to deliver seg-mentation masks for each cell instance for subsequent single-cell fluorescence measurements. We trialled this applicationon unlabelled and unseen data as depicted in Fig. 8. Thecell instances are detected based on the brightfield image(left) and the resulting object segmentation predictions areused as masks to measure the individual cell fluorescence on

the fluorescent channel (right). An overlay of the brightfield,fluorescent images with the segmentation contours is depictedin the middle, along with the green fluorescent protein (GFP)channel. The individual cell area (A1 and A2) is measured asthe number of pixels in the instance segmentation mask andindicated on the GFP channel. The cell instance fluorescence(F1 and F2) is summed over the mask area and indicated onthe right for each individual cell in arbitrary fluorescence units.

brightfield bf + gfp gfp gfp masks

F1 = 6.1x106 a.u.

F2 = 2.5x106 a.u.

A1 = 929 pix.

A2 = 241 pix.

Fig. 8. Example of individual cell fluorescence measurement application witha segmentation mask contour for each individual cell (violet contours violetviolet��) based on the brightfield image (left); scale bar 10µm.

B. Comparison with state-of-the-art methods

We compare our proposed method with the state-of-the-artfor the trapped yeast application (DISCO [15], U-Net [6]),as well as with a general state-of-the-art method for instancesegmentation (Mask R-CNN [1]). We implemented both theU-Net and Mask R-CNN methods in this study (Section III-D).A characteristic qualitative example of the results is given inFig. 9, with the ground truth on the left, followed by Cell-DETR B, Mask R-CNN and U-Net segmentations results. Allthree methods segment two trap microstructures and all fourcells in separate classes, without any overlap or touching cells.Cell-DETR B and Mask R-CNN additionally segment eachcell or trap object as an individual instance. The contours areslightly smaller for the U-Net, which is deemed a result ofthe emphasis on avoiding touching cells and the associateddifficulty of discerning these in subsequent post-processing.

label C-DETR B M. R-CNN U-Net

Fig. 9. Example segmentation for our implementations of Cell-DETR B, MaskR-CNN and U-Net. Trap instances in shades of greygrey �� and cell instances inshades of violetviolet �� (no instance detection for U-Net); scale bar 10µm.

Accurate segmentation of the cells is particularly importantfor the measurement of cell morphology or fluorescence. Wecompare the cell class Jaccard index Jc of our proposedmethods Cell-DETR A and B with the application state-of-the-art methods DISCO, U-Net and Mask R-CNN. The com-parison is summarised in Table IV. U-Net recently supersededDISCO [15] (Jc ∼ 0.7) as the state-of-the-art trapped yeast

segmentation method, achieving Jc = 0.82. Our Cell-DETRvariants both further improve on this result, with model Bachieving the same Jc = 0.84 on par with our Mask R-CNNimplementation. Cell-DETR and Mask R-CNN additionallyprovide each cell object instance.

TABLE IVCOMPARISON OF CELL-DETR PERFORMANCE WITH THE

STATE-OF-THE-ART METHODS FOR THE TRAPPED YEAST APPLICATION(DISCO, U-NET) AND INSTANCE SEGMENTATION (MASK R-CNN).

———–Model

CellClass Jc

Inferenceruntime1

——Instances

DISCO [15]2 ∼ 0.70 ∼ 1300 ms ×U-Net 0.82 1.8 ms ×Mask R-CNN 0.84 29.8 ms XCell-DETR A 0.83 9.0 ms XCell-DETR B 0.84 21.2 ms X

1 Runtimes for U-Net, Mask R-CNN, and Cell-DETR averagedover 1000 runs (∼ 300 different images) on a Nvidia RTX2080 Ti; measurement uncertainty is below ±5%.

2 Reported literature values [15].

We measured the average runtime of a forward pass of eachmethod on a single specimen image (Table IV). For DISCO[15] we consider the reported values that include some pre-and post-processing steps to detect cells individually. The deepmethods are significantly faster than DISCO, making onlinemonitoring of live experiments feasible. The U-Net is thefastest, taking 1.8 ms for a forward pass, in contrast to 29.8 msfor the Mask R-CNN [1]. However, the U-Net requires furtherpost-processing steps to detect the object instances and hasbeen reported to take approximately 20 ms in conjunction withwatershed post-processing [6]. The Cell-DETR variants takethe middle ground with 9.0 ms and 21.2 ms.

V. DISCUSSION

A. Analysis of the instance segmentation performance

Cell-DETR has some benefits in comparison to state-of-the-art methods such, as Mask R-CNN. The Cell-DETRarchitecture is comparatively simple and avoids commonhand designed components of Mask R-CNNs, such as non-maximum suppression and ROI pooling. This reduces Cell-DETR’s reliance on hyperparameters and facilitates end-to-end training with a single combined loss function. In contrast,Mask R-CNNs require additional supervision to train theregion proposal network. As a result of these differences, Cell-DETR is easier to implement, has less parameters and is fasterthan Mask R-CNN for the same segmentation performance.

While Cell-DETR does not rely on explicit region proposals,it does utilise attention maps that highlight the pertinentfeatures in the latent space. The mapping of these is learntduring the end-to-end training. The loss curves of individualprediction tasks are shown in Fig. 10. The classification lossLp (blue) converges first, indicating that the network firstlearns how many objects are present in an image and to whichclass they belong. The bounding box loss Lb (red) convergesnext, with the network learning the approximate location of

each object. Finally, the model learns to refine the pixel-wise segmentation maps with the segmentation loss Ls (green)converging last.

200 400 600 800 1000

0.2

0.4

0.6

0.8

1

training steps

loss

Lp classificationLb bounding boxLs segmentation

Fig. 10. Loss curves of classification, bounding boxes and segmentation forCell-DETR B; thick lines are running averages (window size 30).

With respect to the specific single-cell measurement appli-cation, Cell-DETR offers robust and repeatable instance seg-mentation of yeast cells in microstructures. The key cell classsegmentation performance surpasses the previous state-of-the-art semantic segmentation methods [6], [15] with a cell classJaccard index of 0.84. Additionally, the proposed techniquedirectly detects individual object instances and classifies theobjects robustly (near 100 % accuracy). The robust instancesegmentation performance promises to facilitate cell tracking,increase the experimental information yield and enables Cell-DETR to be employed without human intervention.

B. Limitations, outlook and future potential

The presented models are trained for a specific microfluidicconfiguration and trap geometry. While they are relativelyrobust and fulfil their intended purpose, their utility could bebroadened by expanding the dataset to include more classes,for example different trap geometries. More generally as a in-stance segmentation method, Cell-DETR offers a platform forincorporating future advances in attention mechanisms as theyare increasingly outperforming convolutional approaches. Forexample, replacing the convolutional elements in the backboneand segmentation head with axial-attention [23] may lead tofurther improved performance. Currently, Cell-DETR achievesstate-of-the-art performance and as an instance segmentationmethod is generally suitable for and readily adaptable to awide range of biomedical imaging applications.

The presented Cell-DETR methods can be harnessed forhigh-content quantitative single-cell TLFM. Cell-DETR, MaskR-CNN and U-Net achieve runtimes orders of magnitudesfaster than the previous state-of-the-art trapped yeast method(DISCO [15]). These runtimes coupled with Cell-DETRsrobust instance segmentation makes both online monitoringand closed-loop optimal experimental design of typical exper-iments with approximately 1000 traps feasible. Harnessing thispotential promises to provide increased experimental informa-tion yields and greater biological insights in the future.

VI. CONCLUSIONIn summary, we present Cell-DETR, an attention-based

transformer method for instance segmentation and showcaseit on a typical application. To the best of our knowledge,this is not only the first application of a DETR adaptation tobiomedical data but more generally the first attention-basedtransformer method for direct instance segmentation. Theproposed method has fewer parameters and is 30% faster whilematching the segmentation performance of a state-of-the-artMask R-CNN. A simpler Cell-DETR variant exhibits slightlylesser segmentation performance (Jc = 0.83 instead of 0.84)while requiring 1/3rd of a Mask R-CNN’s runtime. As ageneral instance segmentation model, Cell-DETR achievesstate-of-the-art performance and is deemed suitable and readilyadaptable for a range of biomedical imaging applications.

Showcased on a typical systems or synthetic biology ap-plication, the proposed Cell-DETR robustly detects each cellinstance and directly provides instance-wise segmentationmaps suitable for cell morphology and fluorescence measure-ments. In comparison to the previous semantic segmentationtrapped yeast baselines, Cell-DETR provides better segmen-tation performance with a cell class Jaccard index Jc = 0.84while additionally detecting each individual cell instance andmaintaining comparable runtimes. This promises to reducemeasurement uncertainty, facilitate cell tracking efficacy andincrease the experimental data yield in future applications. Theresulting runtimes and accurate instance segmentation makefuture online monitoring feasible, for example for closed-loopoptimal experimental control.

ACKNOWLEDGEMENTSWe thank Christian Wildner for insightful discussions,

André O. Françani and Jan Basrawi for contributing to la-belling and Markus Baier for aid with the computational setup.

This work was supported by the Landesoffensive für wis-senschaftliche Exzellenz as part of the LOEWE SchwerpunktCompuGene. H.K. acknowledges support from the EuropeanResearch Council (ERC) with the consolidator grant CONSYN(nr. 773196).

REFERENCES[1] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in IEEE

ICCV, 2017, pp. 2961–2969.[2] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen-

son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset forsemantic urban scene understanding,” in IEEE/CVF CVPR, 2016.

[3] J. Sun, A. Tárnok, and X. Su, “Deep Learning-Based Single-Cell OpticalImage Studies,” Cytom. Part A, vol. 97, no. 3, pp. 226–240, 2020.

[4] M. Leygeber, D. Lindemann, C. C. Sachs, E. Kaganovitch, W. Wiechert,K. Nöh, and D. Kohlheyer, “Analyzing Microbial Population Hetero-geneity - Expanding the Toolbox of Microfluidic Single-Cell Cultiva-tions,” J. Mol. Biol., 2019.

[5] A. Hofmann, J. Falk, T. Prangemeier, D. Happel, A. Köber, A. Christ-mann, H. Koeppl, and H. Kolmar, “A tightly regulated and adjustableCRISPR-dCas9 based AND gate in yeast,” Nucleic Acids Res., vol. 47,no. 1, pp. 509–520, 2019.

[6] T. Prangemeier, C. Wildner, A. O. Françani, C. Reich, and H. Koeppl,“Multiclass yeast segmentation in microstructured environments withdeep learning,” IEEE CIBCB (in press), 2020.

[7] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panopticsegmentation,” in IEEE/CVF CVPR, 2019, pp. 9404–9413.

[8] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, andS. Zagoruyko, “End-to-end object detection with transformers,”arXiv:2005.12872, 2020.

[9] J.-B. Lugagne, H. Lin, and M. J. Dunlop, “DeLTA: Automated cellsegmentation, tracking, and lineage reconstruction using deep learning,”PLoS Comput Biol, vol. 16, no. 4, 2020.

[10] T. Prangemeier, F. X. Lehr, R. M. Schoeman, and H. Koeppl, “Microflu-idic platforms for the dynamic characterisation of synthetic circuitry,”Curr. Opin. Biotechnol., vol. 63, pp. 167–176, 2020.

[11] R. Pepperkok and J. Ellenberg, “High-throughput fluorescence mi-croscopy for systems biology,” Nat. Rev. Mol. Cell Biol., vol. 7, no. 9,pp. 690–696, 2006.

[12] D. G. Cabeza, L. Bandiera, E. Balsa-Canto, and F. Menolascina, “Infor-mation content analysis reveals desirable aspects of in vivo experimentsof a synthetic circuit,” in 2019 IEEE CIBCB, 2019, pp. 1–8.

[13] Z. Xie, L. Wroblewska, L. Prochazka, R. Weiss, and Y. Benenson,“Multi-input RNAi-based logic circuit for identification of specificcancer cells,” Science, vol. 333, pp. 1307–1312, 2011.

[14] W. Si, C. Li, and P. Wei, “Synthetic immunology: T-cell engineeringand adoptive immunotherapy,” Synth. Syst. Biotechnol., vol. 3, no. 3,pp. 179–185, 2018.

[15] E. Bakker, P. S. Swain, and M. M. Crane, “Morphologically constrainedand data informed cell segmentation of budding yeast,” Bioinformatics,vol. 34, no. 1, pp. 88–96, 2018.

[16] M. M. Crane, I. B. N. Clark, E. Bakker, S. Smith, and P. S. Swain,“A Microfluidic System for Studying Ageing and Dynamic Single-CellResponses in Budding Yeast,” PLoS One, vol. 9, p. e100042, 2014.

[17] D. A. Van Valen, T. Kudo, K. M. Lane, D. N. Macklin, N. T.Quach, M. M. DeFelice, I. Maayan, Y. Tanouchi, E. A. Ashley, andM. W. Covert, “Deep Learning Automates the Quantitative Analysis ofIndividual Cells in Live-Cell Imaging Experiments,” PLoS Comput Biol,vol. 12, no. 11, pp. 1–24, 2016.

[18] J. Sauls, J. Schroeder, S. Brown, G. Treut, F. Si, D. Li, J. Wang, andS. Jun, “Mother machine image analysis with MM3,” bioRxiv, 2019.

[19] E. Moen, D. Bannon, T. Kudo, W. Graf, M. Covert, and D. Van Valen,“Deep learning for cellular image analysis,” Nat. Methods, vol. 16,no. 12, pp. 1233–1246, 2019.

[20] T. Prangemeier, C. Wildner, M. Hanst, and H. Koeppl, “Maximizinginformation gain for the characterization of biomolecular circuits,” inProc. 5th ACM/IEEE NanoCom, 2018, pp. 1–6.

[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS,2017, pp. 5998–6008.

[22] M. Chen, A. Radford, R. Child, J. Wu, and H. Jun, “Generativepretraining from pixels,” in ICML, 2020, pp. 10 456–10 468.

[23] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen,“Axial-deeplab: Stand-alone axial-attention for panoptic segmentation,”arXiv:2003.07853, 2020.

[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in IEEE/CVF CVPR, 2016, pp. 770–778.

[25] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearitiesimprove neural network acoustic models,” in ICML, 2013, p. 3.

[26] A. Molina, P. Schramowski, and K. Kersting, “Padé activation units:End-to-end learning of flexible activation functions in deep networks,”arXiv:1907.06732, 2019.

[27] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: Moredeformable, better results,” in IEEE/CVF CVPR, 2019, pp. 9308–9316.

[28] H. Su, V. Jampani, D. Sun, O. Gallo, E. Learned-Miller, and J. Kautz,“Pixel-adaptive convolutional neural networks,” in IEEE/CVF CVPR,2019, pp. 11 166–11 175.

[29] H. W. Kuhn, “The hungarian method for the assignment problem,” NavalResearch Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.

[30] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese,“Generalized intersection over union: A metric and a loss for boundingbox regression,” in IEEE/CVF CVPR, 2019, pp. 658–666.

[31] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss fordense object detection,” in IEEE ICCV, 2017, pp. 2980–2988.

[32] K. Chen, J. Wang, J. Pang et al., “MMDetection: Open mmlab detectiontoolbox and benchmark,” arXiv:1906.07155, 2019.

[33] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”in ICLR, 2019.

[34] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Net-works for Biomedical Image Segmentation,” in MICCAI, 2015, pp. 234–241.

Attention-Based Transformers for Instance Segmentation of ......measured by the cell class...

Documents

Transcript of Attention-Based Transformers for Instance Segmentation of ......measured by the cell class...