BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how...

20
BioMed Central Page 1 of 20 (page number not for citation purposes) BMC Bioinformatics Open Access Methodology article Using iterative cluster merging with improved gap statistics to perform online phenotype discovery in the context of high-throughput RNAi screens Zheng Yin †1,3 , Xiaobo Zhou †1 , Chris Bakal †2 , Fuhai Li 1 , Youxian Sun 3 , Norbert Perrimon 2 and Stephen TC Wong* 1 Address: 1 Center for Bioinformatics, The Methodist Hospital Research Institute and Weill Cornell College of Medicine, 6565 Fannin Street, Houston, TX, 77030, USA, 2 Department of Genetics and Howard Hughes Medical Institute, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA, 02115, USA and 3 State Key Laboratory of Industrial Control Technology, Zhejiang University, 38 Zheda Road, Hangzhou, Zhejiang Province, 310027, PR China Email: Zheng Yin - [email protected]; Xiaobo Zhou - [email protected]; Chris Bakal - [email protected]; Fuhai Li - [email protected]; Youxian Sun - [email protected]; Norbert Perrimon - [email protected]; Stephen TC Wong* - [email protected] * Corresponding author †Equal contributors Abstract Background: The recent emergence of high-throughput automated image acquisition technologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular phenotypes in different experimental conditions has been dependent upon the expert opinions of well-trained biologists. Such qualitative analysis is particularly effective in detecting subtle, but important, deviations in phenotypes. However, while the rapid and continuing development of automated microscope-based technologies now facilitates the acquisition of trillions of cells in thousands of diverse experimental conditions, such as in the context of RNA interference (RNAi) or small-molecule screens, the massive size of these datasets precludes human analysis. Thus, the development of automated methods which aim to identify novel and biological relevant phenotypes online is one of the major challenges in high-throughput image-based screening. Ideally, phenotype discovery methods should be designed to utilize prior/existing information and tackle three challenging tasks, i.e. restoring pre-defined biological meaningful phenotypes, differentiating novel phenotypes from known ones and clarifying novel phenotypes from each other. Arbitrarily extracted information causes biased analysis, while combining the complete existing datasets with each new image is intractable in high-throughput screens. Results: Here we present the design and implementation of a novel and robust online phenotype discovery method with broad applicability that can be used in diverse experimental contexts, especially high-throughput RNAi screens. This method features phenotype modelling and iterative cluster merging using improved gap statistics. A Gaussian Mixture Model (GMM) is employed to estimate the distribution of each existing phenotype, and then used as reference distribution in gap statistics. This method is broadly applicable to a number of different types of image-based datasets derived from a wide spectrum of experimental conditions and is suitable to adaptively process new images which are continuously added to existing datasets. Validations were carried out on different dataset, including published RNAi screening using Drosophila embryos [Additional files 1, 2], dataset for cell cycle phase identification using HeLa cells [Additional files 1, 3, 4] and synthetic dataset using Published: 5 June 2008 BMC Bioinformatics 2008, 9:264 doi:10.1186/1471-2105-9-264 Received: 30 November 2007 Accepted: 5 June 2008 This article is available from: http://www.biomedcentral.com/1471-2105/9/264 © 2008 Yin et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Transcript of BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how...

Page 1: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BioMed CentralBMC Bioinformatics

ss

Open AcceMethodology articleUsing iterative cluster merging with improved gap statistics to perform online phenotype discovery in the context of high-throughput RNAi screensZheng Yin†1,3, Xiaobo Zhou†1, Chris Bakal†2, Fuhai Li1, Youxian Sun3, Norbert Perrimon2 and Stephen TC Wong*1

Address: 1Center for Bioinformatics, The Methodist Hospital Research Institute and Weill Cornell College of Medicine, 6565 Fannin Street, Houston, TX, 77030, USA, 2Department of Genetics and Howard Hughes Medical Institute, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA, 02115, USA and 3State Key Laboratory of Industrial Control Technology, Zhejiang University, 38 Zheda Road, Hangzhou, Zhejiang Province, 310027, PR China

Email: Zheng Yin - [email protected]; Xiaobo Zhou - [email protected]; Chris Bakal - [email protected]; Fuhai Li - [email protected]; Youxian Sun - [email protected]; Norbert Perrimon - [email protected]; Stephen TC Wong* - [email protected]

* Corresponding author †Equal contributors

AbstractBackground: The recent emergence of high-throughput automated image acquisitiontechnologies has forever changed how cell biologists collect and analyze data. Historically, theinterpretation of cellular phenotypes in different experimental conditions has been dependent uponthe expert opinions of well-trained biologists. Such qualitative analysis is particularly effective indetecting subtle, but important, deviations in phenotypes. However, while the rapid and continuingdevelopment of automated microscope-based technologies now facilitates the acquisition oftrillions of cells in thousands of diverse experimental conditions, such as in the context of RNAinterference (RNAi) or small-molecule screens, the massive size of these datasets precludes humananalysis. Thus, the development of automated methods which aim to identify novel and biologicalrelevant phenotypes online is one of the major challenges in high-throughput image-basedscreening. Ideally, phenotype discovery methods should be designed to utilize prior/existinginformation and tackle three challenging tasks, i.e. restoring pre-defined biological meaningfulphenotypes, differentiating novel phenotypes from known ones and clarifying novel phenotypesfrom each other. Arbitrarily extracted information causes biased analysis, while combining thecomplete existing datasets with each new image is intractable in high-throughput screens.

Results: Here we present the design and implementation of a novel and robust online phenotypediscovery method with broad applicability that can be used in diverse experimental contexts,especially high-throughput RNAi screens. This method features phenotype modelling and iterativecluster merging using improved gap statistics. A Gaussian Mixture Model (GMM) is employed toestimate the distribution of each existing phenotype, and then used as reference distribution in gapstatistics. This method is broadly applicable to a number of different types of image-based datasetsderived from a wide spectrum of experimental conditions and is suitable to adaptively process newimages which are continuously added to existing datasets. Validations were carried out on differentdataset, including published RNAi screening using Drosophila embryos [Additional files 1, 2], datasetfor cell cycle phase identification using HeLa cells [Additional files 1, 3, 4] and synthetic dataset using

Published: 5 June 2008

BMC Bioinformatics 2008, 9:264 doi:10.1186/1471-2105-9-264

Received: 30 November 2007Accepted: 5 June 2008

This article is available from: http://www.biomedcentral.com/1471-2105/9/264

© 2008 Yin et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 1 of 20(page number not for citation purposes)

Page 2: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

polygons, our methods tackled three aforementioned tasks effectively with an accuracy range of85%–90%. When our method is implemented in the context of a Drosophila genome-scale RNAiimage-based screening of cultured cells aimed to identifying the contribution of individual genestowards the regulation of cell-shape, it efficiently discovers meaningful new phenotypes andprovides novel biological insight. We also propose a two-step procedure to modify the noveltydetection method based on one-class SVM, so that it can be used to online phenotype discovery.In different conditions, we compared the SVM based method with our method using variousdatasets and our methods consistently outperformed SVM based method in at least two of threetasks by 2% to 5%. These results demonstrate that our methods can be used to better identify novelphenotypes in image-based datasets from a wide range of conditions and organisms.

Conclusion: We demonstrate that our method can detect various novel phenotypes effectivelyin complex datasets. Experiment results also validate that our method performs consistently underdifferent order of image input, variation of starting conditions including the number andcomposition of existing phenotypes, and dataset from different screens. In our findings, theproposed method is suitable for online phenotype discovery in diverse high-throughput image-based genetic and chemical screens.

BackgroundMetazoan cells have the ability to adopt an extraordinarilydiverse spectrum of cell shapes. For example, the cuboi-dal, polarized morphology of epithelial cells differs mark-edly from that of neuronal cells, which extend long, thin,and highly-branched projections. The shape of an individ-ual cell is the result of a complex interplay between theactivity of thousands of genes and the cell's environment.Understanding this interplay is a fundamental challengein developmental and cell biology. Currently, there aretwo key aspects to deciphering cellular morphogenesis ongenome-scale. The first is determining the individualfunctional contributions of every gene towards the regula-tion of cell shape, and the second is to describe how com-plex relationships between cell shape genes affectmorphology. With the advent of high-throughput RNAinterference (RNAi) screening technologies, particularlyin model systems such as Drosophila melanogaster [1], it isnow possible to systematically query the involvement ofgenes in the regulation of different cellular processes andfunctions. Typically, RNAi-based genetic screens involvethe acquisition of relatively low-content, single-dimen-sional data which is easily analyzed using conventionaland unbiased means and thus feasible to perform ongenome, or multi-genome scales [1,2]. In order to facili-tate similar analysis of image-based screens, we and otherresearchers have recently developed novel image segmen-tation algorithms to rapidly quantitate hundreds of differ-ent parameters at a single-cell level in an automatedfashion [3-6], and we have demonstrated that such imagesegmentation algorithms can be used in the context ofgenetic screens [7]. Notably however, this and other simi-lar screens [8] have been 50–100 fold smaller in scale thantypical low-dimensional screens and are not yet genome-scale. The reduced scale of these screens is due, largely inpart, to the fact that the expert opinion of cell biologists is

still an essential and rate-limiting aspect in the analysis ofmany image-based datasets. Although human interven-tion is not required in screens where the potential pheno-typic outcomes are few or binary in number (e.g. animage-based screen where a particular marker is deter-mined to be nuclear or non-nuclear), such intervention iscurrently necessary in order to identify novel/subtle phe-notypes in image-based datasets of genetic or chemicalperturbations where the dynamic range of cellular pheno-types cannot be predicted before the data is collected. Forexample, in genome-scale screens for regulators of cellshape, it is impossible to predict a priori the diversity ofmorphologies that will ultimately be present in the data-set. The failure to accurately measure this phenotypic var-iation will lead to concomitant classification errors,especially false negatives, and misleading results. Currentmethodologies usually employ a two-step procedure tomaximize the amount of variation that is captured in aparticular image-based analysis. First, 100–600 pheno-typic features are measured on a single-cell level (auto-matically but somehow exhaustively), and second,supervised techniques assisted by biologists are used toboth reduce dimensionality of feature space and carryingout classification on the images. The biologist has to atleast perform preliminary qualitative visual scoring of asmall part of the dataset in order to gain a crude assess-ment of the phenotypic variance that is present in thissubset. Unfortunately, it is impossible to perform suchanalysis in the course of screens where millions of imagesare acquired, thus the ability of these screens to identifynew phenotypes is greatly limited. The issues of definingmeaningful phenotypes and describing them usinginformative feature subsets are closely related. Automatedfeature space reduction schemes have been implementedin the context of high content screen, including featureextraction methods examined in [9], factor analysis in

Page 2 of 20(page number not for citation purposes)

Page 3: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

[10] and SVM-RFE method in [11]. These methods allowmore effective modelling of existing phenotypes, and alsoprompt the necessity of updating informative feature setsso that they can not only model the existing, but also dis-cover the novel.

Cluster analysis is widely used to reveal the structure ofunlabeled datasets. Specifically, there are a number ofmethods that have been developed in order to estimatecluster numbers from a dataset such as using a series ofinternal indices [12], jump methods [13], and weightedgap statistics [14]. Moreover, supervised approaches tocluster validation such as using re-sampling strategy [15],prediction strength [16], methods based on mixture mod-els and inference of Bayesian factors[17,18], or strategieswhich are application-specific [19] have also been previ-ously implemented. Nevertheless, most existing methodsare subject to certain hypothesis on a fixed dataset, andcannot be directly used for online phenotype discoverywhere new images continuously extend the dataset andmillions of cells are involved. Improper assumptions ondata structure may cause incorrect division or merging ofbiologically meaningful phenotypes. To avoid this prob-lem, such methods combine each new image with thewhole existing dataset (regardless of the large difference incell numbers) and frequently re-run from the very begin-ning.

Methods for online phenotype discovery should be sensi-tive and flexible to various phenotypes and avoid frequentre-modelling involving complete existing datasets. As akernel machine based novelty detection method, one-class SVM is used for "off-line" phenotype discovery [20].However, two major points limit its application to high-throughput image-based screens, especially for screens ofcell shape regulators. First, in one-class SVM all the testsamples are classified into two classes, "novel" and"known", however many high-throughput RNAi datasetsmay potentially contain multiple diverse and unique novelphenotypes which should not necessarily be groupedtogether. Subsequent cluster analysis would be needed toidentify and model different novel phenotypes followingthe use of one-class SVM. Second, each time a novel phe-notype is discovered using one-class SVM, the supportvectors need to be modified so that the newly discoveredphenotype are included as "known" in the followingloops, otherwise it will continuously be identified asnovel in future. As mentioned earlier, in a typical RNAiscreen on 1,000–10,000s genes with dozens of images foreach RNAi and 100s of cells in each image, such updatingwould involve millions of cells and is intractable.

Here we describe the development of an online pheno-type discovery pipeline that we implemented in the con-text of a high-throughput image-based RNAi screen for

regulators of cell shape. A simplified scheme of onlinephenotype discovery is shown in Figure 1. Online pheno-type discovery demands adaptively identifying variousnovel phenotypes based on multiple existing phenotypes(e.g. those identified a priori by biologists), being sensitiveand flexible to various new phenotypes and avoiding fre-quent re-modelling using large existing dataset. Ourmethod includes two key components: phenotype model-ling and iterative cluster merging. First, a Gaussian Mix-ture Model (GMM) is estimated for each existingphenotype following [21]. Second, iterative cluster merg-ing are performed based on gap statistics. When a newimage is incorporated, we sample the GMM of each exist-ing phenotypes and start a series of merging loops. In eachloop, the image is combined with sample set for one exist-ing phenotype and we estimate cluster number in suchcombined dataset using gap statistics and use GMM ofexisting phenotype as part of the reference distributions. Ifsome cells in the new image are clustered together withsamples from the existing phenotype, they are mergedinto the existing phenotype, i.e. they are included into thedataset of existing phenotype and deleted from the newimage. The iterations continue until sample set from eachexisting phenotype has been combined with the newimage and has merged with its counterpart (if any exist).Upon completion of all loops, the remaining cell groupsin the new image are identified as the candidate of newphenotypes. By sampling reference dataset from newimage and existing phenotype separately, utilizing theGMM for existing phenotypes as (part of) reference distri-bution and involving existing clusters one by one, ourmethod improves the ideas in [12] and becomes moreeffective. Experimental results show that the proposedmethod is robust and efficient for online phenotype mod-elling and discovery in the context of diverse image-basedscreens, especially RNAi screens on Drosophila.

ResultsSynthetic datasetOvercoming large sample size difference between two clustersDifference between sample numbers of distinct clusterscould bias cluster number estimation. We propose totackle this problem by using GMMs as reference distribu-tions for existing phenotypes in gap statistics and validateour method using simulation.

Each simulated dataset consists of observations from twopopulations �1 and �2, each population are sampledfrom a two dimensional Gaussian distribution withmeans (0,0) for �1 and (0,3) for �2, and an identity cov-ariance for both groups. Gap statistics [12] are used to esti-mate number of clusters from the experiment dataset �1∫ �2. This method uniformly samples different referencedatasets from the support of �1 ∫ �2, and here thenumber of reference datasets is set to 20. Then the experi-

Page 3 of 20(page number not for citation purposes)

Page 4: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

Page 4 of 20(page number not for citation purposes)

Tasks and simple scheme of online phenotype discoveryFigure 1Tasks and simple scheme of online phenotype discovery.

Page 5: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

ment dataset �1 ∫ �2 and the 20 reference datasets areclustered into candidate cluster numbers k = 1... K, and weset K = 10. For each clustering result, we measure the com-pactness of obtained clusters using "within cluster disper-sion". For each cluster number k, such dispersion aremeasured separately on experiment dataset and each ref-erence dataset, and gap statistic for k, denoted as gap(k), isdefined as the average value of difference between the dis-persion on experiment dataset and that on each referencedataset, meanwhile we obtained standard deviation ofsuch difference across 20 reference dataset, and denotedas sk.

Some typical gap statistic curves are summarized in Figure2 to illustrate the problem and validate our method. X axisin Figure 2 indicates candidate cluster number i, and eachdata point denote gap(k) while the error bar indicating sk.Larger value of gap(k) means better compactness when thedataset is clustered into k clusters (compared with the ref-erence datasets which simulate mono-genous data), thena increase from gap (k-1) to gap (k) means the clusteringperformance is improved from k-1 to k. We take sk intoconsideration when estimating cluster numbers following[12], and take the estimated cluster number as the first kwith gap(k)>gap(k+1)-sk+1 (the candidate number whosedata point is higher than the bottom of error bar for itsinstant right neighbor). Details of gap statistics methodare discussed in Methods section.

When samples from �1 and �2 have identical number of100, gap statistics can correctly judge sample number as 2,then we set sample number from �1 as 200, 300, 500,700, 900 and 1000, and sample number from �2 are fixedas 100. Figure 2, left and Figure 2, middle show that whennumber difference is over six-fold, gap statistics does notwork correctly.

A ten-fold difference of cell numbers between existingphenotypes and new images is not the worst situation wewould face in online phenotype discovery. Based on theresults given in Figure 2, middle, samples from two obvi-ously different populations would be merged together.We propose to solve this bias through fitting a GMMmodel for existing phenotypes, as well as using GMM asreference distribution for each existing cluster in gap sta-tistics.

Next, we consider �1 as existing cluster with known distri-bution model. If we use Gaussian model for �1 as refer-ence dataset for gap statistics, it gives the correct result in87.4% occasions across 500 experiments even with ten-fold difference in sample number. Figure 2, right showsone gap curve with ten-fold difference, where Gap(1)-(Gap(2)-s2) = -0.0006, meanwhile Gap(2)>Gap(3)-s3,thus, the estimated cluster number is 2 rather than 3(although Gap(3)>Gap(2)), because data point for k = 2 ishigher than the bottom of error bar for k = 3.

Gap statistic curves for dataset with different sample numberFigure 2Gap statistic curves for dataset with different sample number. Each curve represents experiment on one real dataset, and twenty reference datasets are defined from this real dataset. For each data point, value on X-axis indicates how many clus-ters are defined on both the reference dataset and the real dataset and value on Y-axis indicates gap statistic for this cluster number, which is defined as the average difference of within cluster dispersions between the clustering results on reference datasets and real dataset, the error bars around the data points show the variation across different reference datasets. The estimated cluster number is defined as the X value of the first data point with higher Y value than the bottom of error bar for its instant right neighbor. During the experiments, the "real" dataset consists of two clusters and different reference datasets are used, Left, uniform reference distribution are used, sample number differences are equal, 2-fold, 3-fold and 5-fold from bottom to top, gap statistics works; middle, uniform reference are used, sample number differences are 7-fold, 9-fold and 10-fold from bottom to top, gap statistic fails; right, two clusters having 10-fold difference in sample number, Gaus-sian distribution is used as reference distribution for the cluster with larger sample numbers, and the cluster number is esti-mated accurately.

Page 5 of 20(page number not for citation purposes)

Page 6: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

The results are summarized in Table 1. Generally speak-ing, when two distinct clusters have ten-fold difference insample numbers, original gap statistics method using uni-form reference fails to estimate cluster number correctly,while our method of using accurate model as referencedistribution can overcome this issue and give a correctestimation of cluster number.

Simulating typical cells using seven types of polygonsSeven types of polygons are defined based on the fluores-cent cell image from a real genome wide RNAi screen(protocol discussed later), and a simulation dataset is con-structed based on these different types of polygons. Infor-mation about these seven polygons is shown in Figure 3and some details on generating these polygons are intro-duced in [Additional file 1].

2000 polygons from each of seven types were generatedand used as training dataset, i.e. the set of existing pheno-types, and another 2000 polygons were generated for eachtype to as testing dataset. In each experiment, we startedfrom a certain set of existing phenotypes and built GMMfrom training samples, meanwhile, we iteratively chosetwo of seven polygon types and selected 100 polygonsapiece from the testing dataset to form a synthetic testimage, altogether 70 images can be formed for one exper-iment. Using the model estimated from the training set,we can identify existing and novel type of polygons fromthese synthetic images, and observe the performance ofour method under different number of novel phenotypesand order of image input.

Performance under different sets of existing phenotypesFigure 4 shows the general performance of our methodunder different sets of existing phenotypes. We changedthe number and composition of existing phenotypes fordifferent experiments, (different composition means dif-ferent phenotypes are considered as "existing". GMM forexisting phenotypes were estimated from training datasetto start each experiment), and for each set of existing phe-notypes, we divided the testing set into 70 synthetic

images and shuffled the order of image input 50 times.Whatever set of existing phenotypes we use, the syntheticimages always contain all seven phenotypes. We candefine accuracy for each polygon type as "the proportionof test samples restored into their original cluster". If onephenotype is used as existing phenotype in an experi-ment, the accuracy for this phenotype is defined as theproportion of testing cells (in this phenotype) mergedinto the original existing cluster; while if the phenotype isnovel, the accuracy for this phenotype equals the propor-tion of testing cells in this phenotype which are left alonein a separate cluster after all the merging loops. The accu-racies are then averaged to report general performance ofour method.

Figure 4 shows our method as having consistent accuracyaround 85% for different polygons under different condi-tions, and the best performance is seen when the numberof existing phenotypes are 3 and 4. When the number ofexisting phenotype is 6, more false negatives appears asnovel samples are merged into existing phenotypes, andthus prompts the importance of cluster validation andmore refined multiple hypothesis tests.

Box and whisker plots for performance under different conditionsAs indicated earlier, the test datasets always consist ofseven phenotypes, we can then change the conditions ofexperiments (number and composition of existing pheno-types and order of image input) and observe the perform-ance for certain phenotype to test the robustness of ourmethod. Across experiments with different input order ofimages and composition of existing phenotypes, the per-formance on certain phenotypes are ranked, and such dis-tribution of performance are illustrated using box andwhisker plot. We show the box plots for ellipses and 16-point stars in Figure 5 and explain the meaning of suchplots in its caption. We observed that larger variation ofaccuracy values correlates with lower number of existingphenotypes, which is because the models of novel pheno-types are estimated from test samples input in the earlierstage of experiment and updated as new images are

Table 1: Cross validation results on overcoming the bias of cluster size difference. By using distribution models as reference distribution gap statistics can give correct result even under 10-fold difference.

Difference between sample number of �1 and �2 Average cluster number estimation accuracy % (Uniform reference distribution)

Average cluster number estimation accuracy % (GMM as reference distribution for �1, uniform

reference for �2)

Equal 100 1002-fold 88.5 98.13-fold 81.8 93.35-fold 69.2 91.07-fold <20 89.59-fold <20 88.910-fold <15 87.4

Page 6 of 20(page number not for citation purposes)

Page 7: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

Page 7 of 20(page number not for citation purposes)

Information on seven polygon phenotypes used in simulationFigure 3Information on seven polygon phenotypes used in simulation.

Page 8: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

included. Thus, the accuracy has a larger variance acrossdifferent image input order when the number of existingphenotypes is low. This issue indicates the importance ofmore robust strategy of model updating. However, theoverall performance of our method is robust regarding todifferent condition of experiments.

Performance comparison with SVM based methodsOne-class SVM [20] tackles the novelty detection problemof differentiating novel phenotypes from known ones byestimating a distribution from the core structure of exist-ing dataset and model such distribution using a series ofsupport vectors. It then labels each testing samples as"known" or "novel" using the model built upon supportvectors. Compared with SVM used in classification, aparameter ν ∈ (0, 1) (denoted as 'Nu' in figures) isinvolved in one-class SVM. This parameter is used todefine the core structure of the existing dataset, and it hastwo roles, i.e. the asymptotic upper bound of training datawhich are labelled as outliers and the lower bound of thefraction support vectors in training samples. However,one-class SVM itself cannot be used to handle problemssuch as the restoration of multiple existing phenotypes.We modify one-class SVM to fit it into the scenario ofonline phenotype discovery. Each new image is combinedwith the support vectors trained from the existing samplesand novelty detection is carried out using one class SVMwith Gaussian kernels of width 0.5 and various parame-ters ν to define the scale of support vectors and outliers.After novelty detection, each test sample labelled as

"known" is subject to multiple linear SVM classifiers(trained from one pair of existing phenotypes) andassigned into one of multiple existing phenotypes accord-ing to majority vote among classifiers. We detailed one-class SVM and our modification in [Additional file 1].

We compared our method and SVM based method andFigure 6 shows the comparison results in two occasions.Four sets of parameters are used for one-class SVM. Figure6left shows the performance when ellipses are consideredas novel phenotype while the other six phenotypes serveas the set of existing dataset; and Figure 6right shows theperformance when 16-point stars are the only novel phe-notype. Accuracies for all seven phenotypes were meas-ured and averaged across 100 experiments with differentorders of image input. In both cases, when ν = 0.1, SVMbased method merged most of the new samples into exist-ing phenotypes and gave best performance on existingphenotypes (especially for 16-point stars), while SVMwith ν = 0.5 left out most of the new samples as outliersand gave the best accuracy for the novel phenotype, mean-while, the difference between best and worst accuracy inone experiment could be as large as 25%; our method out-performed most SVM based method on the accuracies forat least 3 of 7 phenotypes and the difference between bestand worst accuracy across seven types are never greaterthan 10%.

Performance of our method on synthetic datasets with different sets of existing phenotypesFigure 4Performance of our method on synthetic datasets with different sets of existing phenotypes. For different number of "existing phenotypes" (X-axis), the performance on all seven types of polygons is summarized. Accuracy (Y-axis) indicates the ratio of test samples restored into its original clusters. All accuracy values are averaged across experiments with 50 different orders of image input and different composition of existing phenotypes (for number of existing phenotype 1–6, we have 7, 21, 35, 35, 21 and 7 different compositions, respectively).

Page 8 of 20(page number not for citation purposes)

Page 9: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

Page 9 of 20(page number not for citation purposes)

Box and whisker plots indicating the robustness of performance under different conditionFigure 5Box and whisker plots indicating the robustness of performance under different condition. The accuracy of each experiment is sorted in descending order and plotted on the Y-axis, the two horizontal edges of boxes indicate upper and lower quartile of accuracy values while the red line in the box body shows the median value. The whiskers and lines extending from the end of boxes show the extent of the rest data, and red crosses (+) are outliers with accuracy values beyond 1.5 times of inter quartile range. The performances on two polygon types are shown. Accuracy values of different experiments with dif-ferent image input order but the same number of existing phenotypes are summarized in box and whisker plots. Upper, per-formance on ellipses phenotype; Lower, performance on 16-point stars phenotype.

Page 10: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

Cell culture, image segmentation, morphological feature extraction and selectionCell culture and image acquisitionAs a next step, we implemented our methods in the con-text of a high-throughput image-based screen. In particu-lar, we focused on a novel dataset of images acquired inthe course of a genome-scale RNAi screen for regulators ofDrosophila Kc167 cell shape that have hemocyte-like prop-erties (Bakal et al, unpublished). By using dsRNA to targetand inhibit the activity of specific genes/proteins, the roleof individual genes in regulating morphology can be sys-tematically determined. Briefly, Kc167 cells are bathed inthe presence of individual dsRNAs targeting all knownDrosophila protein kinases and phosphatases in 384-wellplates (detailed protocols are available at [22]). Followinga 5-day incubation period, the cells are fixed and stainedwith reagents in order to visualize the nuclear DNA (bluechannel in all images), polymerized F-actin (green), andα-tubulin (red). For each well, sixteen images from eachof the three channels (blue, green and red) were acquiredin an automated fashion using an Evotec spinning-diskconfocal with a 60× water objective. Auto-focusing is per-formed in a two-step fashion by first focusing on the bot-tom well at each individual site, and then moving theobjective by the same Z-distance (in this case 3 μm abovethe bottom of the well) at each site. The images were cap-tured at a binning of 2 and have a resolution of 661*481pixels.

Image segmentationTo analyze the morphology of single cells, it is necessaryto first delineate the boundaries of individual cells. Directsegmentation of the cell bodies in the F-actin and α-tubu-lin channels is difficult due to the complex morphology ofcellular boundaries. Segmentation of nuclei in the DNAchannel is relatively easier, and its segmentation results

provide the rough position information of the cell bodies.Herein, we utilize a two-step segmentation procedure[5,23] including nuclei segmentation on DNA channel,and cell body segmentation of images derived by combin-ing images from the DNA, F-actin and α-tubulin channels.

In nuclei segmentation, the nuclei are first separated fromthe background by using a background correction basedadaptive thresholding method [24]. However, the clus-tered nuclei cannot be separated by the adaptive thresh-olding method. To separate the clustered nuclei, thecentres of the nuclei are first detected using a gradient vec-tor field (GVF) based detection method [24]. Specifically,we filter the nuclei image using a Gaussian filter, whichsuppresses the noise and generate local maxima insidecells, and these local maxima correspond to the nucleicentres. However, there are still some local maxima due tonoise. To further eliminate the noisy local maxima, wedetect the true cell centres using GVF method. It is a well-known fact that in an electric field, the electric field linespoint to the positive electrodes, and the free negative elec-trons move along the electric field lines and stop at theseelectrodes. In GVF, the gradient-vector lines also point tothe local maxima. Analogous to the electron movinginside the electron field, we put one particle on eachdetected cell pixel and pushed it along the gradient vectorlines. Consequently, these particles stop at these localmaxima. Since no or very few particles stop at non-maxima and noisy local maxima, the true cell centres canbe identified by choosing the points that have many par-ticles [24]. After the centres of nuclei are detected, thenuclei are segmented using the marker-controlled water-shed algorithm.

To use both F-actin and α-tubulin channels information,we combine the two channels' signal as I = (IF-actin + Iα-tubu-

Performance comparison between our methods and SVM based methods on two occasionsFigure 6Performance comparison between our methods and SVM based methods on two occasions. In each experiment, six polygon types are used as existing phenotypes. Accuracy denotes the ratio of samples restored to its original phenotypes. All the accuracy values are averaged across 100 tests having different order of image input. Four different sets of parameters are used for SVM based method. Left Ellipses serve as novel phenotype, the other six serve as existing phenotype; Right 16-point stars serve as novel phenotype.

Page 10 of 20(page number not for citation purposes)

Page 11: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

lin)/2, where I, IF-actin and Iα-tubulin denote the combinedimage, F-actin channel image and Iα-tubulin channel imagerespectively. We then segment the cell bodies using thecombined image. First the cell bodies are separated fromthe background using the aforementioned adaptivethresholding algorithm. The nuclei segmentation resultsfacilitate the segmentation of cell bodies by providing therough position information of cell bodies. Herein, weemploy the marker-controlled watershed and the nucleisegmentation results to segment the individual cell bod-ies. To reduce the over-segmentation of cell bodies a feed-back system proposed in [5] is employed. Three scoringmodels, which measure the morphological appearance,gradient and edge intensity of cell pairs respectively, arebuilt to identify the over-segmented cell bodies, and guidethe merging procedure [5]. Detailed shape and boundaryinformation of nuclei and cell bodies is obtained after thetwo-step segmentation procedure.

Morphological feature extraction and feature selectionCellular phenotype identification depends on choosing arich set of descriptive features, which is one of the mostcritical steps for pattern recognition problems. To capturethe geometric and appearance properties, 211 morphol-ogy features belonging to five categories are extracted fol-lowing [23]. The selected features include a total of 85wavelet features (70 of them from Garbor wavelet trans-formation [25] and 15 features from 3-level CDF97 wave-let transformation [26]), 10 geometric region featuresdescribing the shape and texture characteristics of cells[23], 48 Zernike moments features with selected order of12 [27], 14 Haralick texture features [28] and a total of 54phenotype shape descriptor features (36 features of ratiolength of the central axis projection and 18 features of areadistribution over equal sectors) [23]. A feature selectionprocedure is necessary to de-noise the dataset and

describe it in the most informative way. As the datasetsand phenotype models are being updated adaptively, anunsupervised feature selection without relying on pheno-type labels is used to supply a stable feature subset. It isbased on iterative feature elimination using k nearestneighbour features following [29]. In this study, aninformative subset of fifteen features is selected to quan-tify the segmented cells.

Online phenotypes discovery in the context of RNAi high-throughput screeningsFitting GMM model for existing phenotypesWe first performed a visual examination on a subset ofdataset in order to define images and cellular morpholo-gies of typical normal cells, as well as cells in three distinctcellular phenotypes. We termed these quantitative catego-ries "Long Punctuate Actin (LPA)", "Cell Cycle Arrest(CCA)" and "Rho1 (Rho)", collected images in these cat-egories and combined them with images of 1583 normalcells, to form our cell database of existing phenotypes. Weestimated a GMM for each existing phenotype using EMalgorithms. A uni-modality model is obtained for normaland LPA phenotypes, and CCA and Rho phenotypes endup with 2 and 3 Gaussian terms respectively and the cov-ariance matrices are set to be diagonal. A brief introduc-tion of each existing phenotype is shown in Figure 7.

Case 1: merging cells in existing phenotypesTo validate our method's ability of restoring three existingphenotypes: Normal, LPA and CCA, we carried out 100times of five-fold cross validation. In each experiment,20% of cells from each existing phenotypes were takenout and combined to form a flow of new images, thesecells were divided into seven groups, simulating sevennew images. We defined merging accuracy for each phe-notype as the proportion of test samples merged into its

Information of four existing phenotypes in training datasetFigure 7Information of four existing phenotypes in training dataset.

Page 11 of 20(page number not for citation purposes)

Page 12: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

original phenotype. The mean and standard deviation ofaccuracies across 100 experiments was shown in Table 2.

Our method can identify and merge cells into originalphenotypes well. In the third column of Table 2, we listthe typical mistakes made during the merging loops.Some cells with normal and LPA phenotypes are notmerged correctly, and such mistakes suggest the existenceof previously undefined phenotypes in such images.

Case 2: discovering new phenotypes: cross validation based on known phenotypesTo validate our method's ability of discovering new phe-notypes, we performed the following experiments afterthe a priori identification of four existing phenotypes(Normal, LPA, CCA and Rho1) by biologists. In eachexperiment, three of four phenotypes were considered asexisting phenotypes, and cells in the other phenotypewere divided into groups with 100 cells. These cell groupsrepresent an incoming dataset of new images containing"novel phenotypes". Given only one phenotype in thetesting images, ideally in each experiment, our methodwould identify the first of such cell groups as a new phe-notype, model this phenotype, assign all the other testingimages to this "novel phenotype", and no testing cellswould be merged by any of the other three phenotypes.Fifty experiments were performed to identify each pheno-type, 1000 cells served as "novel group" for normal andLPA phenotypes while 600 cells were used for CCA andRho1 phenotypes. The experiment results are summarizedin Table 3. "Accuracy" is defined as the ratio between thenumber of testing cells assigned to a single cluster and thetotal number of testing cells, and such accuracy is calcu-lated after all three merging loops for each experiment.

Case 2 shows our method's ability of identifying novelphenotypes. We hypothesize that the relatively low accu-racy for CCA and Rho phenotypes can be attributed to thesmall number of samples and incomplete understandingof which phenotypes is the biological representative forthe entire treatment class. High classification accuraciesfor normal cells in both case 1 and case 2 provide strongvalidation of the ability of our methods to identify wild-type cells. While the overlap of normal and LPA serves asa starting point for novel phenotypes discovery.

Case 3: identifying multiple novel phenotypes from online image input and performance comparison with SVM based methodsIn this case, we still used the test dataset in case 2, whichincluded a total of 3,200 cells from four phenotypes. Ineach group of experiments we started from the models(available from previous step) of two existing phenotypes,and all 3,200 test cells were divided into 32 images withcells from two phenotypes in one image, and all imageswere input with 50 different orders. Altogether, threegroups of experiments were carried out, and in each exper-iment, normal phenotype were paired with one of theother three phenotypes, to serve as sets of ''existing pheno-types''. Both our method and SVM based methods wereused in each experiment, and we can thus validate ourmethod's ability to deal with multiple novel phenotypeswell as the performance under different order of imageinput.

Figure 8 summarizes the average performance of ourmethod and SVM based methods for each phenotypeacross 50 experiments. The results from different sets ofexisting phenotypes are shown separately, and "accuracy"is defined as the proportion of test samples restored intoits original cluster. The performance of our methoddegraded with reduced number of existing phenotypescompared with case 1 and case 2, especially for the Rho1phenotype, however, it still performed consistently onexisting and novel phenotypes, never failed to reach 80%mark and outperformed SVM based method in at leasttwo of four phenotypes on all occasions.

Figure 9 shows the box and whisker plots for our methodsunder three different sets of existing phenotypes. The var-iation of accuracy across different order of image inputcan be as large as 8% (Figure 9bottom, accuracy for CCAphenotype), but the accuracy is never less than 80% Fortwo experiment with the lowest accuracies and shown asoutlier in the box and whisker plot (Figure 9top, CCA andFigure 9bottom, LPA), the order of image input was vali-dated, and on both occasions we observed a small groupof novel cells emerging at the beginning of image flowwith all their counterparts appearing very late. Thus, themodel for novel phenotypes was only built based on asmall group of cells, and most test cells appeared laterwere merged by similar phenotypes with more stablemodels. This indicates the necessity of better cluster vali-

Table 2: Cross validation results on merging cells into existing phenotypes

Phenotype Average merging accuracy and standard deviation % Typical Mistakes

Normal 98.6 (1.7) left aloneLPA 93.8 (2.9) Merged into Normal, left aloneCCA 92.4 (2.4) left alone

Page 12 of 20(page number not for citation purposes)

Page 13: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

dation and model updating solution, which are proposedin our method.

Having multiple phenotypes in a single image is a chal-lenge in the analysis of image based high-throughputscreens. Our method successfully tackle such cases andidentify multiple phenotypes from online image input,and can therefore provide a better perspective for furtherquantification of the whole image, en route to the identifi-cation for the role of each gene.

Identifying "rl/tear-drop" phenotype in the context of Drosophila genome-scale RNAi screenTypically, a new image which is incorporated into thedataset may have less than fifty cells, which can severelyimpact the ability to quantitatively determine and distin-guish novel phenotypes. But such images can still beincorporated into our analysis by discarding images withcell numbers <10, and putting together cells from multi-ple new images to make a combined dataset having atleast 100 cells. We collected cells from four existing phe-notypes, assembled them as training dataset, modelledexisting phenotypes and used such phenotype models toanalyze images acquired as part of a Drosophila genome-scale RNAi screen. Implementation of our methodrevealed a phenotype that was undetected by visualinspection which we termed as "rl/tear-drop". These cellsare small in size and having smooth boundaries and non-

round shape. Figure 10 supplies some information andtypical images for this phenotype. Such phenotype wasinitially discovered in wells where cells had been incu-bated in the presence of rl RNAi. Drosophila rl, or rolled isthe homolog of mammalian ERK kinase and is a centralregulator of a host of cellular processes [30]. Importantly,the rl/tear-drop phenotype was not scored during humaninspection of the images. The detection of cells with the rl/tear-drop phenotypes in an image or well often correlateswith detection of cells with an LPA phenotype, and wehypothesize that rl/tear-drop represents a phenotypes thatoccur as cells transition from normal to LPA. We collectedmore than 500 cells with the rl/tear-drop phenotypethrough analysis of replicate experiments, modelled thisphenotype with GMM, and used it to analyze cells fromthe new images. Our method detects rl/tear-drop cells inexperiments where CG10673, CG7236, Nipped-A and Ptenhave been targeted by RNAi. Although the nature of therelationship among these genes needs further investiga-tion, we have previously identified Nipped-A and Pten asregulators of ERK activity following insulin stimulation[2], suggesting a relevant relationship between thesegenes. Altogether, these results demonstrate that ouronline phenotype discovery methods can be used to pro-vide unexpected and novel biological insight.

We also tested our method using two published datasetfrom Drosophila RNAi screen [7] and HeLa cell cycle phase

Table 3: Cross validation results on discovering new phenotypes

Phenotype to be identified # of cells to be identified Average accuracy with standard deviation % Typical Mistakes

Normal 1000 95.2 (3.5) Left aloneLPA 1000 90.3 (3.1) Merged into Normal, left aloneCCA 600 89.6 (2.7) Merged into Rho1Rho1 600 87.4 (4.2) Left alone

Performance comparison between our method and SVM based methods with multiple phenotypes in imagesFigure 8Performance comparison between our method and SVM based methods with multiple phenotypes in images. Given certain group of existing phenotypes, and images including multiple phenotypes, the accuracy values for four phenotypes across 50 image input orders are shown. Four different sets of parameters are used for SVM based method. Left Normal and LPA as existing phenotypes; Middle Normal and CCA as existing phenotypes; Right Normal and Rho as existing phenotypes.

Page 13 of 20(page number not for citation purposes)

Page 14: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

Page 14 of 20(page number not for citation purposes)

Box and whisker plots indicating the robustness of performance with multiple phenotypes in imagesFigure 9Box and whisker plots indicating the robustness of performance with multiple phenotypes in images. The accu-racy of each experiment is sorted in descending order and plotted on the Y-axis, the two horizontal edges of boxes indicate upper and lower quartile of accuracy values while the red line in the box body shows the median value. The whiskers and lines extending from the end of boxes show the extent of the rest data, and red crosses (+) are outliers with accuracy values beyond 1.5 times of inter quartile range. Given certain group of existing phenotypes, and images including multiple phenotypes, the accuracy values for four phenotypes across 50 image input orders are shown. Top Normal and LPA are existing pheno-types; Middle Normal and CCA are existing phenotypes; Bottom Normal and Rho are existing phenotypes.

Page 15: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

detection [ref. s3 in Additional file 1], compared its per-formance with SVM based methods and validated ourmethod's ability of handling dataset from various organ-isms. These experiments are described in [Additional file1], the results on Drosophila dataset from [7] are reportedin [Additional file 2] and results on HeLa dataset arereported in [Additional file 3, 4].

All other functions are developed in Matlab 7.0 and ran inPC with Intel® Core™ 2 T7200 2.00 GHz CPU and 2.00 GBof RAM. Starting from four existing phenotypes, the aver-age running time for our method based on improved gapstatistics is 1.8 seconds on a group of 100 segmented cells,10.2 seconds on a group of 600 cells and 19.4 seconds ona group of 1000 cells. Considering the fact that cellnumber in each image is seldom over 300 in the reportedhigh content screen, our method is suitable for onlineapplication.

DiscussionOnline identification and validation of novel morpholog-ical phenotypes are major challenges in specific high-throughput image-based screens. Manual phenotypelabelling of high-throughput image-based data is a labori-ous and inordinately time-consuming process, whileavailable automatic identification methods usually clas-sify cells into a limited set of predefined phenotypeswhich may be determined through biased means and willnot be updated according to the online image input. Asmillions of images are now generated during the course of

a comprehensive genome-scale screen, new methods areneeded to effectively identify novel phenotypes in suchmassive databases. Here we report the development of anonline phenotype discovery method which models exist-ing phenotypes, compares cells in new images with exist-ing phenotype models through cluster analysis, assignssome new cells to existing phenotypes, and finally identi-fies and validates novel phenotypes online.

GMM is used for modelling existing phenotypes and gapstatistics, with GMM as reference distribution for existingphenotype, plays a key role in cluster analysis and merg-ing. We built GMM for existing phenotypes, sampleddatasets from the model and used them as reference dis-tribution in gap statistics method, following this pipelinewe can cover the complete properties of phenotypes moreefficiently. Furthermore, gap statistics are dealing withonly one existing phenotype plus a part of the new imagein each merging loop, and the content of new image isiteratively updated with the merging procedure. Wepresent Additional file 5 to validate the idea of modelingexisting phenotype using GMM, the detailed informationof GMM estimated from four existing phenotype in ourreal dataset are reported along with histograms for sometypical feature.

For analysis of high content screen data, many researcherschoose to summarize the information of single cells orobjects to supply a normalized signature for higher levelconcepts (e.g. treatment conditions, genes, complexes,

Information for the rl/tear-drop phenotypeFigure 10Information for the rl/tear-drop phenotype. "Typical cells" summarizes the properties of the rl-tear drop phenotype and "Typical image" shows an image with cells merged by rl-tear drop, LPA and Normal phenotype respectively.

Page 15 of 20(page number not for citation purposes)

Page 16: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

etc). Thus it is critical to identify different phenotypesrelated to a same treatment condition. Our method can beused in identifying multiple phenotypes in single well andsupply detailed insight into related questions. The per-formance of our methods relies on the quality of imageprocessing, feature selection, phenotype modelling, andcluster analysis methods. Using iterative cluster merging,our future goals are to build more reliable phenotypemodels and to construct complete pipelines of clusteranalysis with detailed validation procedures to obtainmore reliable definition of clusters.

ConclusionWe propose an online phenotype discovery method forhigh-throughput RNAi screen, which can be used in thecourse of many image-based screens. This method isbased on adaptive phenotype modelling and iterativecluster merging using improved gap statistics. Given data-sets for existing phenotypes, the method can build amodel of each existing phenotype, identify novel pheno-types in images obtained from ongoing screening andassign newly obtained cell images into different pheno-types. Compared with traditional novelty detection tech-niques, our approach avoids frequent re-modellinginvolving the huge existing dataset and can handle multi-ple existing phenotypes in a flexible manner. Implemen-tation of our methods in the analysis of images acquiredduring a genetic screen for regulators of Drosophila cellmorphology demonstrates the power of these computa-tional tools in efficiently discovering meaningful newphenotypes.

MethodsOnline cluster discovery: problem formulation

Suppose we have identified K0 non-overlapping cellular

phenotypes, the i-th cell in the m-th existing phenotype is

denoted by vector , with each cell

described by p morphological features. Then, let

denotes dataset for the m-th phenotype,

with um indicating the number of cells for m-th pheno-

type. Thus, we denote the dataset of all available cells, ,as:

and the total number of existing cells is . Simi-

larly, when a new image is obtained, the i-th cell in this

image is also described using p features, and denoted by ei

[ei1, ei2 ... eip], and , where ν is the number of

cells in . New images are continuously obtained, andeach new image contains tens of cells while there arethousands cells for each m, thus v <<um <u.

Given a new image , we need to adaptively determinenumber of new phenotypes Knew, based on K0, and .

Cells in while belonging to some existing phenotype

m should be identified, and used to update model for

m. It is unfeasible to involve every single cell in into

cluster discovery, because the large scale of could biascluster analysis towards existing phenotypes and addcomputation burden. On the other hand, "new cluster"identified only according to is vulnerable to outliers.Thus an efficient method to utilize is necessary.

Outline of the proposed approach

We propose to discover new clusters through iterativecluster merging. The dataset of each existing phenotype

m is first fit to a GMM and sample dataset is

obtained from such model. Each is combined with a

new image one by one to detect possible new pheno-types. Our online phenotype discovery method is out-lined as below:

(1) Phenotype modelling. A GMM is fit to each existingphenotypes using Expectation-Maximization (EM) algo-rithm following [21].

(2) Sampling existing phenotype and combining existinginformation with the new image. We sample from the

GMM of one existing clusters, say m, m ∈ {1, 2... K0}, get

the sample set , and put together with new image

, we denote this combined set as , thus

should have comparable cell numbers as ν, the cell

number in , so that phenotype information would notbe overwhelmed due to limited cell number of . We

empirically set sample number of as ν to 5ν.

(3) Estimating the cluster number in . An improved gapstatistics method is used in which we take reference data-

set from the range of feature values of and sepa-

sim

im

im

i pms s s( )

,( )

,( )

,( )[ , ... ]1 2

Sm im

i

ums( ){ }

=1

S

S S

S S

=

∀ ∈{ } = ∅=

m

m

K

m nm n K1

0

0

1 2

∪∩s.t. , , ... ,

(1)

u umm

K

==

∑1

0

E

E = { } =ei i

v

1

EE

S

ES E

ES

S S

S

ES

S Sm’

Sm’

E

S

Sm’ Sm

E F

F S E= m’ ∪ (2)

Sm’

EE

Sm’

F

Sm’ E

Page 16 of 20(page number not for citation purposes)

Page 17: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

rately, and use GMM as the reference distribution for

reference samples obtained from the support of .

(4) Defining clusters on . Based on the estimated clus-ter number from step 3, a partition of is obtained usingPartitioning Around Medoids (PAM) [31] method.

(5) Merging samples from to existing phenotypes.

(a) If some samples from are assigned to a same cluster

as at least 95% samples from , they are considered as

a candidate for merging.

(b) Validate merging operation using a statistical test withBonferroni correction. For each merging candidate, calcu-late its p value under the GMM for m, reject the merging

operation and keep this sample in if p value is smallerthan 1/K0, or else merge this candidate into m and delete

it from .

(6) Returning to step 2 to sample another existing pheno-type and start new merging loop.

(7) Updating phenotype models. After each existing phe-

notype m, m ∈ {1, 2... K0}merges their counterparts in

, define clusters left in as new phenotypes and esti-mate GMM for them.

Through modelling and re-sampling, becomes moreflexible and re-useable, allowing us to cover completeproperties of phenotypes. Following data modelling andsampling, the information from existing phenotypes iscombined with new image one by one in the loops fromstep 3 to 5. Thus in each single loop, the task of estimatingcluster number is simplified to identifying differencebetween new image and only one existing phenotype.After each m merges its counterpart in , clusters left in

are identified as new phenotypes.

Cluster modelling and sampling

Given the dataset , we model each phenotype m using

a GMM:

where N denotes Gaussian distribution. We denote thenumber of Gaussian terms for phenotype m as Qm and

define parameters for m as

. Initially,

the covariance matrix Σm, t is set to be diagonal. We use

Expectation-maximization (EM) algorithm to estimate

{πm, μm, Σm} from m. In the initialization of EM algo-

rithm, Qm is set to four, and m is first partitioned into Qm

clusters using fuzzy C-means clustering method, and theninitial parameters are estimated using the standard vectorquantization method. For each class, Qm is reduced to the

minimum possible using minimum description length(MDL) technique, following [21]. We obtain random

samples from the GMM to form set having i.i.d m,

, and is combined with new image to form .

When estimating cluster number in using gap statis-

tics, GMM is used as reference distribution for .

Estimating cluster numbers using improved gap statistics

To estimate the number of clusters from an unlabeleddataset, many existing methods focus on the within clus-ter dispersion Wk, resulting from clustering datasets (e.g.

) into k clusters, C1, C2,... Ck with Cr denoting the indi-

ces of samples in clusters r and fi, j denotes the value of j-

th feature measured from i-th data point. Based on

, we have . Wk

tends to decrease monotonically as the number of clustersk increases, but from some k on, such decrease flattensmarkedly. Statistical folklore has it that error measurebased on Wk should have an "elbow" at the desirable clus-

ter number, thus different criterions based on Wk are

defined.

Gap statistics [12] method utilizes the output of any clus-tering algorithm under different k, compares the change ofWk to the dispersion expected under a reference null dis-

tribution, the gap between the logarithms of these twodispersions are employed to detect cluster number. Gapstatistics can detect homogenous non-clustered dataagainst the alternative of clustered data [14]. This ability iscritical when all cells in new image belong to the samephenotype.

Sm’

FF

E

E

Sm’

S

ES

E

S

E E

S

ES E

E

S S

Sm m t m t m t

t

Q

m t

t

Q

m

N m K

s t

m

m

∼ π

π π

, , ,

, ,

( , ), , , ...,

. . ,

ππ ΣΣ =

=

=

=

1 2

1

0

1

1

tt ≥ 0

(3)

S

S

ππ μμ μμ ΣΣ ΣΣm m t tQ

m m t tQ

m m t tQm m m= = == = ={ } , { } , { }, , ,π 1 1 1

S

S

Sm’ S

Sm’ Sm

’ E F

F

Sm’

F

D f fr i j i jj

p

i i Cr

= −=∈∑∑ ( ), ,

,

2

1W Dk r

r

k

nr=

=∑ 1

21

E

Page 17 of 20(page number not for citation purposes)

Page 18: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

To estimate cluster number in (defined in equation(2)), we first select a number K which is larger than theexpected cluster number, e.g. in our case K = 5. For each k= 1, 2... K, is divided into k clusters and gives a seriesof Wk. Then we generate B reference datasets from ,

cluster them into k clusters and obtain

. We use PAM [31] for clus-

tering. Considering the mean dispersion

across B = 20 reference datasets, and

their standard deviation ,

the item is taken into consideration for

a better control about the rejection of null model [12],and estimated cluster number k' is:

The reference distribution is a null model of data struc-ture. In [12], reference datasets are sampled uniformlyeither from the range of observed values for each feature,or the range of a box aligned with the principle compo-nents of data. However, it is encouraged to estimate refer-ence distribution from existing samples rather thansimply using uniform distribution because the boundingbox of the whole dataset always includes some "blank"area. The existing cluster definition can help us focus onwhere the data really lies, and avoid generating a referencedataset violating the properties of original one. Here, we

sampled from GMM of each existing phenotype and

it makes sense to use this GMM as the reference distribu-

tion for .

The problem now is to generate reference dataset from

GMM, because is combined with new image to

form the dataset , and the model of is unavailable.

We have to deal with and separately because the

distribution of is unavailable. We propose to solve this

problem by generating reference dataset from (GMM

reference distribution) and (uniform reference distri-

bution) separately and combine two sets together, i.e.

substituting with

We discuss more detail of this strategy in the [Additionalfile 1], and supply a figure as [Additional file 6] to illus-trate the motivation and innovation of our strategy.

Gap statistics method repeatedly carries out clusteringusing a set of candidate cluster numbers, and pick up thenumber supplying best within cluster dispersion as esti-

mated cluster number. GMM is an accurate model for ,

and using GMM as reference distribution can avoid therisk of split biological meaningful clusters and retain bio-logical properties of existing phenotypes. The selection ofK and B controls the number of clustering operation to becarried out and greatly influences the complexity of thewhole methods. An effective way of utilizing existing phe-notypes can greatly reduce K and make gap statisticsmethod more suitable for online phenotype discovery inhigh-throughput image-based screens.

Cluster definition and merging

We use Partitioning Around Medoids (PAM; also knownas K-medoids) [31] to do clustering on combined set .PAM provides better flexibility and robustness of choos-ing suitable dissimilarity measurements for differentapplications [32] and more efficient compared to Fuzzyclustering methods, especially in our cases where cluster-ing are carried out frequently.

After clustering, we get a non-overlapping partition of

, where the

cluster number K' is determined through gap statistics,and we adjust the cluster labels to make sure that 1

includes the largest ratio of samples from existing pheno-

type . Merging operation is done according to this par-

tition. Datasets , m ∈ {1, 2,... K0} are combined with

one by one, and in each loop, the overlapped part of

and (if any) is located in 1, deleted from and

included as part of existing cluster:

F

FF

W b B k Kkb∗ = =, , , ... , , , ...1 2 1 2

l W Bkbb

_log( ) /= ∗∑

sd W l Bk kbb

= −∗∑(log( ) ) /_

2

s sd Bk k= + −1 1

Gap kB

W W

diff k Gap k Gap k s

kb k

b

k

( ) log( ) log( )

( ) ( ) ( ( ) )

*= −

= − + −

∑+

1

1 1

kk diff kk K

, ,...inf ( ( ) )= ≥

=1 20

(4)

Sm’

Sm’

Sm’ E

F E

Sm’ E

E

Sm’

E

Supp Supp m( ) ( )’F E S= ∪

P Supp Supp Supp m_ ( ) ( ) ( )’F E S= ∪ (5)

Sm’

F

F F F F= ∀ ∈{ } = ∅=

mm

K

m nm n K1

1 2’

, , , ... ,’∪ ∩

F

Sm’

Sm’

E

Sm’ E F E

E E F ES S F E

← −←

( )

( )1

1

∩∪ ∩m m

(6)

Page 18 of 20(page number not for citation purposes)

Page 19: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

after all merging loops, clusters left in are defined asnew phenotypes.

The above merging strategy is based on theoretical case of

, but in reality, when we consider random sam-

ple set from GMM of an existing phenotype, it is pos-

sible that some samples are randomly far from the centreof existing phenotype and thus assigned into different

clusters with their majority counterparts in . We take

two strategies to protect the merging operation from influ-ence of outliers.

(1) Merging operations only happens when some cells in are assigned into 1, together with more than 95% of

samples . And such cells are considered candidates for

merging operation.

(2) For each merging candidate in , we carry out statis-tical test with Bonferroni correction. We calculate the p

value for each candidate with respect to the GMM for ,

i.e. possibility of obtaining a value at least as extreme as (ifnot more) this candidate under the GMM. The corrected pvalue for each candidate is defined as its p value with

respect to the GMM of divided by number of existing

phenotypes K0. If the corrected p value is lower than 0.05/

K0, the merging operation is rejected and we keep that

candidate in .

In the merging loops, we focused on identifying samplesof existing phenotypes from . While some cells of are merged into existing phenotypes, novel clusters grad-ually stand out.

Authors' contributionsNP and CB conceived the biological hypothesis whileSTCW and XZ conceived the bioinformatics part of thestudy. ZY and XZ designed the framework of online phe-notype discovery and performed tests, CB and colleaguescarried out genome-scale RNAi screens, helped to set upinitial phenotype database and validated final results, FLdeveloped the image processing methods. All authorscontribute in writing this paper, have read and approvedthe final version of this manuscript.

Additional material

E

F S1 ⊇ m’

Sm’

Sm’

E F

Sm’

E

Sm’

Sm’

E

E E

Additional file 1Supplementary materials on performance validation and algorithm details. Methods of simulating real cells using polygons are noted. Docu-ments and results for two more validation experiments are provided, and two published dataset using different organisms were involved in these experiments. More details of four existing phenotypes used in the main text are presented, including the histogram of specific feature and key param-eters for estimated GMM. Our improvement on gap statistics method and one-class SVM novelty detection are also discussed in detail.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-264-S1.doc]

Additional file 2Performance comparison on restoring biological meaningful cluster from published high throughput screen dataset. These two histograms report the comparisons on the ability of restoring biological meaningful pheno-clusters between our method and SVM based method. The compar-ison carried out on a published high throughput screen dataset based on Drosophila BG-2 cell line, and results using two different groups of exist-ing phenotypes are presented separately.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-264-S2.tiff]

Additional file 3Typical images and information for datasets of four cell cycle phases in HeLa cells. In this figure, typical images and some information from a published dataset of HeLa cells are summarized. This dataset consists of single channel fluorescent images of HeLa nuclei in four cell cycle phases and it was used to illustrate the prospect of combining our method to data-set from various organisms.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-264-S3.tiff]

Additional file 4Performance comparison on cell cycle phase identification using HeLa dataset. These three histograms report performance comparisons between our method and SVM based method. The comparisons were carried out on the HeLa dataset described in Additional file 3, and the results using three different groups of existing phenotypes are presented separately.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-264-S4.tiff]

Additional file 5Information on four existing phenotypes for case 1–4: histogram for major axis length and complete model parameters. This figure extends the information in Figure 7 of main text. The histogram for major axis length helps to show the necessity of modelling each morphological feature using GMM, and the parameters of estimated models are also available.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-264-S5.tiff]

Page 19 of 20(page number not for citation purposes)

Page 20: BMC Bioinformatics BioMed Centralperrimon/papers/Yin_BMC.pdftechnologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular

BMC Bioinformatics 2008, 9:264 http://www.biomedcentral.com/1471-2105/9/264

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

AcknowledgementsThe authors would like to thank research members of The Methodist Hos-pital Research Institute, Radiology Department, and we thank colleagues in the Department of Genetics at Harvard Medical School, for technical sup-port and comments. CB is a Fellow of the Leukemia and Lymphoma Society. NP is an Investigator of the Howard Hughes Medical Institute. The research is funded by The Methodist Hospital Research Institute (Stephen T.C. Wong).

References1. Perrimon N, Mathey-Prevot B: Applications of high-throughput

RNAi screens to problems in cell and developmental biology.Genetics 2007, 175:7-16.

2. Friedman A, Perrimon N: functional genomic RNAi screen fornovel regulators of RTK/ERK signaling. Nature 2006,444:230-234.

3. Zhou X, Liu KY, Bradley P, Perrimon N, Wong STC: Towards auto-mated cellular image segmentation for RNAi genome-widescreening. Lecture Notes in Computer Science (MICCAI 2005)3749:885-892.

4. Xiong G, Zhou X, Ji L, Bradley P, Perrimon N, Wong STC: Auto-mated segmentation of Drosophila RNAi fluorescence cellu-lar images using deformable models. IEEE Transactions on Circuitand Systems 2006, 53:2415-2424.

5. Li FH, Zhou X, Wong STC: An automated feedback system withthe hybrid model of scoring and classification for solvingover-segmentation problems in RNAi high content screen-ing. Journal of Microscopy 2007, 226(2):121-132.

6. Yan P, Zhou X, Shah M, Wong STC: Automatic segmentation ofRNAi fluorescent cellular images with interaction model.IEEE Transactions on Information Technology in Biomedicine 2008,12(1):109-117.

7. Bakal C, Aach J, Church G, Perrimon N: Quantitative morpholog-ical signatures define local signaling networks regulating cellmorphology. Science 2007, 316:1753-1756.

8. Perlman Z, Slack M, Feng Y, Mitchison T, Wu L, Altschuler S: Multi-dimensional drug profiling by automated microscopy. Science2004, 306:1194-1198.

9. Huang K, Velliste M, Murphy RF: Feature reduction for improvedrecognition of subcellular location patterns in fluorescencemicroscope images. Proceedings of SPIE 2003, 4692:307-318.

10. Yong D, Bender A, Hoyt J, McWhinnie E, Chirn G, Tao CY, TallaricoJ, Labow M, Jenkins J, Mitchison T, Feng Y: Integrating high-con-tent screening and ligand-target prediction to identify mech-anism of action. Nature Chemical Biology 2008, 4(1):59-68.

11. Loo L, Wu L, Altshuler S: Image based multivariate profiling ofdrug responses from single cells. Nature Methods 2007,4(5):445-453.

12. Tibshirani R, Walther G, Hastie T: Estimating the number ofclusters in a dataset via the gap statistic. Journal of Royal Statis-tics Society 2001, 32(2):411-423.

13. Sugar C, James G: Finding the number of clusters in a dataset:An information-theoretic approach. Journal of the American Sta-tistical Association 2003, 98(463):750-763.

14. Yan M, Ye K: Determining the number of clusters using theweighted gap statistics. Biometrics 2007, 63:1031-1037.

15. Dudoit S, Fridlyand J: A Prediction-based resampling method toestimate the number of clusters in a dataset. Genome Biology2002, 3(7):research 0036.1-0036.21.

16. Tibshirani R, Walther G: Cluster validation by predictionstrength. Journal of Computational & Graphical Statistics 2005,14(3):511-528.

17. Guo P, Chen P, Lyu M: Cluster number selection for a small setof samples using the Bayesian Ying-Yang model. IEEE Transac-tions on Neural Networks 2002, 13(3):757-763.

18. Gangnon R, Clayton M: Cluster detection using Bayes factorsfrom over-parameterized cluster models. Environmental andEcological Statistics 2007, 14:69-82.

19. Bickel D: Robust cluster analysis of microarray gene expres-sion data with the number of clusters determined biologi-cally. Bioinformatics 2003, 19(7):818-824.

20. Schölkopf B, Platt J, Shawe-Taylor J, Smola A, Williamson R: Estimat-ing the support of a high dimensional distribution. NeuralComputation 2001, 13:1443-1471.

21. Zhou X, Wang X: Optimisation of Gaussian mixture model forsatellite image classification. IEE Proceedings-Vision, Image and Sig-nal Process 2006, 153(3):349-356.

22. Drosophila RNAi Screening Center (DRSC) at Harvard Med-ical School [http://www.flyrnai.org]

23. Wang J, Zhou X, Bradley PL, Perrimon N, Wong STC: Cellular phe-notype recognition for high-content RNAi genome-widescreening. Journal of Molecular Screening 2008, 13(1):29-39.

24. Li FH, Zhou X, Zhu J, Ma J, Huang X, Wong STC: High contentimage analysis for human H4 neuroglioma cells exposed toCuO nanoparticles. BMC Biotechnology 2007, 7:66. (9 October2007)

25. Manjunath BS, Ma WY: Texture features for browsing andretrieval of image data. IEEE Transactions on Pattern Analysis andMachine Intelligence 1996, 18:837-842.

26. Cohen A, Daubechies I, Feauveau JC: Bi-orthogonal bases of com-pactly supported wavelets. Communications on Pure and AppliedMathematics 1992, 45:485-560.

27. Zernike F: Beugungstheorie des schneidencerfarhens und-seiner verbesserten form, der phasenkontrastmethode. Phys-ica 1934, 1:689-704.

28. Haralick RM, Shanmugam K, Dinstein I: Textural features forimage classification. IEEE Transactions on Systems, Man and Cyber-netics 1973, 6:610-620.

29. Mitra P, Murthy CA, Pal S: Unsupervised feature selection usingfeature similarity. IEEE Transactions on Pattern Analysis and MachineIntelligence 2002, 24(3):301-312.

30. Koch W: Coordinaring ERK/MAPK signalling through scaf-folds and inhibitors. Nature Reviews Molecular Cell Biology 2005,6(11):827-838.

31. Kaufman L, Rousseeuw P: Finding groups in data: an introduc-tion to cluster analysis. Wiley, New York; 1990.

32. Thalamuth A, Mukhopadhyay I, Zheng X, Tseng G: Evaluation andcomparison of gene clustering methods in microarray analy-sis. Bioinformatics 2006, 22(19):2405-2412.

Additional file 6Improving the strategy of taking reference dataset for gap statistics: motivation and innovation. This figure illustrates why we have to modify the strategy of taking reference dataset in the context of online phenotype discovery and how we work it out. The limitations of existing method, as well as the idea of our improvement are illustrated.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-264-S6.tiff]

Page 20 of 20(page number not for citation purposes)