A divide-and-conquer strategy to solve the out-of-memory ...hji403/...2.2 The divide-and-conquer...

10
396 Int. J. Computational Biology and Drug Design, Vol. 1, No. 4, 2008 Copyright © 2008 Inderscience Enterprises Ltd. A divide-and-conquer strategy to solve the out-of-memory problem of processing thousands of Affymetrix microarrays Chia-Ju Lee Computational Biology and Bioinformatics Program, Northwestern University, 2145 Sheridan Rd., Evanston, IL 60208, USA E-mail: [email protected] Dong Fu and Pan Du Center for Biomedical Informatics, and Robert H. Lurie Comprehensive Cancer Center, Feinberg School of Medicine, Northwestern University, 676 North St. Clair, Suite 1200, Chicago, IL 60611, USA E-mail: [email protected] E-mail: [email protected] Hongmei Jiang Department of Statistics, Northwestern University, 2006 Sheridan Rd., Evanston, IL 60208, USA E-mail: [email protected] Simon M. Lin* and Warren Kibbe Center for Biomedical Informatics, and Robert H. Lurie Comprehensive Cancer Center, Feinberg School of Medicine, Northwestern University, 676 North St. Clair, Suite 1200, Chicago, IL 60611, USA E-mail: [email protected] E-mail: [email protected] *Corresponding author Abstract: Out-of-memory problem was frequently encountered when processing thousands of CEL files using Bioconductor. We propose a divide-and-conquer strategy combined with randomised resampling to solve this problem. The CAMDA 2007 META-analysis data set which contains 5896 CEL files was used to test the approach on a typical commodity computer cluster by running established pre-processing algorithms for Affymetrix arrays in the Bioconductor package. The results were validated against a golden standard obtained by using a supercomputer. In addition to the performance

Transcript of A divide-and-conquer strategy to solve the out-of-memory ...hji403/...2.2 The divide-and-conquer...

  • 396 Int. J. Computational Biology and Drug Design, Vol. 1, No. 4, 2008

    Copyright © 2008 Inderscience Enterprises Ltd.

    A divide-and-conquer strategy to solve the out-of-memory problem of processing thousands of Affymetrix microarrays

    Chia-Ju Lee Computational Biology and Bioinformatics Program, Northwestern University, 2145 Sheridan Rd., Evanston, IL 60208, USA E-mail: [email protected]

    Dong Fu and Pan Du Center for Biomedical Informatics, and Robert H. Lurie Comprehensive Cancer Center, Feinberg School of Medicine, Northwestern University, 676 North St. Clair, Suite 1200, Chicago, IL 60611, USA E-mail: [email protected] E-mail: [email protected]

    Hongmei Jiang Department of Statistics, Northwestern University, 2006 Sheridan Rd., Evanston, IL 60208, USA E-mail: [email protected]

    Simon M. Lin* and Warren Kibbe Center for Biomedical Informatics, and Robert H. Lurie Comprehensive Cancer Center, Feinberg School of Medicine, Northwestern University, 676 North St. Clair, Suite 1200, Chicago, IL 60611, USA E-mail: [email protected] E-mail: [email protected] *Corresponding author

    Abstract: Out-of-memory problem was frequently encountered when processing thousands of CEL files using Bioconductor. We propose a divide-and-conquer strategy combined with randomised resampling to solve this problem. The CAMDA 2007 META-analysis data set which contains 5896 CEL files was used to test the approach on a typical commodity computer cluster by running established pre-processing algorithms for Affymetrix arrays in the Bioconductor package. The results were validated against a golden standard obtained by using a supercomputer. In addition to the performance

  • A divide-and-conquer strategy to solve the out-of-memory 397

    improvement, the general divide-and-conquer strategy can be applied to any other normalisation algorithms without modifying the underlying implementation.

    Keywords: divide-and-conquer; resampling; out-of-memory; microarray; Affymetrix arrays; R/Bioconductor; supercomputer; computer cluster.

    Reference to this paper should be made as follows. Lee, C-J., Fu, D., Du, P., Jiang, H., Lin, S.M. and Kibbe, W. (2008) ‘A divide-and-conquer strategy to solve the out-of-memory problem of processing thousands of Affymetrix microarrays’, Int. J. Computational Biology and Drug Design, Vol. 1, No. 4, pp.396–405.

    Biographical notes: Chia-Ju Lee is currently a MS candidate in Computational Biology and Bioinforamtics at Northwestern University, USA.

    Dong Fu is a senior system administrator at the Biomedical Informatics Center of Northwestern University (NUBIC). He received a MS Degree in Computer Science from Loyola University Chicago and a BS Degree in Computer Studies from Northwestern University while working as a system administrator for various departments at Northwestern University. Prior to that, he did undergraduate work at Wabash College and finished with BA Degrees in Biology, Chemistry and Economics. He has keen interests in the computational side of biological science and healthcare in general. Recently, he is also involved in various cancer Biomedical Informatics Grid (caBIG) activities.

    Pan Du is the Research Assistant Professor at the Biomedical Informatics Center of Northwestern University. He completed his PhD Degree in Bioinformatics and Computational Biology at Iowa State University, and his MS and BS Degree in Electrical Engineering at the National University of Defense Technology, China. His research interests include microarray and proteomcis data analysis, next generation sequencing and systems biology.

    Hongmei Jiang received her PhD in Statistics from Purdue University. She is currently an Assistant Professor in the Department of Statistics at Northwestern University. Her research interests include multiple comparisons and multiple testing procedures, microarray data analysis, statistical genetics and genomics, and experimental design.

    Simon M. Lin currently is the Director of Bioinformatics Consulting for the Robert H. Lurie Comprehensive Cancer Center at Northwestern University. He completed his MD Degree in Medical Informatics and MS Degree in Computational Chemistry. He was the author of Microarray Data Analysis (Springer, volume I to IV) and a winner of the Apple Cluster Computing Award for Bioinformatics in 2004. His research is on the topic of translational medicine and bioinformatics, with an emphasis on genetics and information technology. In particular, he has been working on high throughput biology, which includes next generation sequencing, microarrays, and proteomics.

    Warren Kibbe is a Research Associate Professor and the Director of Bioinformatics for the Robert H. Lurie Comprehensive Cancer Center, the Center for Genetic Medicine at Northwestern University and is the Associate Director of the Northwestern University Biomedical Informatics Center. He received his PhD in Chemistry from Caltech, and was a visiting scientist at the Max Plank Institute in Göttingen, Germany before joining the faculty at Northwestern. He is a co-founder of the OBO Foundry Disease Ontology

  • 398 C-J. Lee et al.

    (http:/diseaseontology.sourceforge.net) and has worked with the BRIDG and CDISC to build semantically computable domain analysis models for the conduct of clinical trials.

    1 Introduction

    Extremely large amount of microarray data has been produced and accumulated. As of October 2007, 173,486 arrays have been deposited into the NCBI GEO database (NCBI GEO, 2007) and 97,635 arrays deposited into the ArrayExpress (EMBL-EBI ArrayExpress, 2007). The databases of the two largest repositories for microarray data have grown exponentially each year (Barrett et al., 2007; Parkinson et al., 2007). It is anticipated that more information can be discovered from large-scale experiments or large compilations of experiments than from small experiments with just a handful arrays. Moreover, owing to the decreased cost and increased credibility of the microarray technology, it is not unrealistic to expect a large scale experiment using thousands of arrays in the near future.

    To process a large volume of microarray data requires tremendous computing resources, which are far beyond the capacity of most research labs. A frequently encountered issue is the out-of-memory problem when processing thousands of CEL files generated by the Affymetrix platform using Bioconductor. R/Bioconductor is one of the most frequently used computational packages for microarray data analysis (Gentleman et al., 2004). It includes a variety of open-source libraries to address the issues of background correction, normalisation, expression index summarisation, and quality assessment. However, R, by design, is not efficient to handle large data sets. Moreover, most of the Bioconductor libraries were developed with a test data set of only dozens of microarrays.

    One of the most frequently encountered problems of using R/Bioconductor to analyse a large data set is the error message: “…Error: cannot allocate vector of size xxx,xxx Kb….” Consequently, similar questions have been repeatedly asked and discussed at the R and Bioconductor Mailing List (BML, 2007). For instance, the vector allocation problem has been discussed 175 times at the BML between January 2007 and September 2007. Current attempts to solve this problem include:

    • To increase the memory allocation that can be used by R and to increase the virtual memory. However, this solution is limited by the 4 GB theoretical memory addressing capability of 32-bit computing. The solution is further worsened by the fact that certain operating systems (such as 32-bit Windows) have a 3 GB memory limit for individual applications.

    • To increase memory usage efficiency by rewriting the library in C. For instance, the most popular method of RMA was re-implemented in C and called ‘just.rma’. However, it is still limited by the maximum memory addressing capability of 32-bit computing. In addition, to rewrite all possible emerging methods is not a scalable solution.

  • A divide-and-conquer strategy to solve the out-of-memory 399

    • To redesign the memory-usage architecture of the affy library. This effort is represented by the new BufferedMatrix library written by Ben Bolstad (http://bmbolstad.com/software/) to swap the memory with the disk storage. Individual libraries have to be written to take advantage of the BufferedMatrix architecture.

    • Aroma.affymetrix, an R package developed by Henrik Bengtsson for analysing large Affymetrix data sets (http://groups.google.com/group/aroma-affymetrix/). It can process unlimited Affymetrix arrays. On-file caching and in-memory caching are the ideas used to solve out-of-memory problem.

    • Lastly, users can purchase or use native 64-bit hardware, install native 64-bit operating system and then run the 64-bit version of the R to get around the memory allocation issue temporarily.

    Thus, we were interested in finding out the computational limit of R/Bioconductor and aimed to find a general solution to the out-of-memory problem without rewriting individual libraries in R/Bioconductor. We proposed a divide-and-conquer strategy combined with the randomised resampling to solve this problem. The solution was inspired by the miracle of resampling methods used in computational statistics that relies on one’s own resource to solve the problem.

    2 Results and discussion

    2.1 A supercomputer with 1 TB of memory failed to run the affy library with 5896 arrays

    To illustrate the computational challenge, the E-TABM-185 data set was tested on a supercomputer cluster, the University of Illinois’ National centre for Supercomputing Applications (NCSA), one of the five original centres in the National Science Foundation’s Supercomputer Centres Program. The SGI Altix computer at NCSA has 1 TB (= 1000 GB) of physical memory with modern 64-bit CPU (Itanium2) and 64-bit OS (SuSELinux). The computational resources for normal users are a maximum of 24 processes per job and a maximum of 256 GB memory per process.

    A 64-bit version of R was recompiled on this machine. However, the out-of-memory problem was still encountered when loading entire 5896 arrays in one single batch. Debugging of the process suggested that the limitation was caused by the design of the affy library in C, a result of a 32-bit programming model. In order to find out the computational limit, we kept decreasing the number of CEL files until the RMA function could be implemented. Finally, we found that the maximum of 4235 CEL files could be loaded in the supercomputer for the implementation of the RMA function. The peak memory usage was about 64 GB. The running time was around 6 h.

    This kind of supercomputer is not universally accessible to most small labs. The most common computational resource is a 32-bit or 64-bit Unix cluster with 64 to 128 CPUs and up to 16 GB memory. Although the theoretic addressing capability of 64-bit CPU is 18 exabyte, most of the commodity 64-bit computers have a hardware design limit of 16 to 32 GB.

  • 400 C-J. Lee et al.

    2.2 The divide-and-conquer approach to large data set

    Divide-and-conquer is a classical solution to many large computational problems. In many bioinformatics applications, the use of parallel computing technologies to divide and conquer tasks is a common approach where large sequential data sets are analysed (Trelles, 2001). A number of packages to enable parallel computing in R have been developed with different levels of skills and additional resources required to use them (Vera et al., 2008). In the case of microarray data analysis, a simple divide-and-conquer will not generate an optimal answer, because the final result will be highly dependent on the initial order of the splitting. To solve this problem, a Monte Carlo method is used to resample the initial condition many times. Note that a calculation using the full combination of initial condition is usually infeasible. For instance, taking 40 arrays at a time out from 6000 arrays will generate 1.4 * 10 ^ 103 possible combinations.

    The RMA function in the affy library was used to illustrate and test the divide-and-conquer approach. RMA is a procedure developed by Irizarry et al. (2003) to analyse Affymetrix data. It includes background correction (mixture model), normalisation (quantile), and summarisation (robust median fitting), all in one function.

    We designed a library in R/Bioconductor called divide-and-conquer RMA (dcRMA). The workflow of dcRMA is shown in Figure 1. The algorithm is as follows:

    • Before loading CEL files, randomly shuffle the order of the CEL files

    • Sequentially divide the CEL files into small subsets which contain S CEL files

    • Repeat step 1 and 2 B times

    • Dispatching small subsets to N CPUs to do the RMA function

    • The gene expression values for each gene of each CEL file in different subsets are averaged.

    Figure 1 Workflow of dcRMA

    2.3 Running dcRMA on data set of E-TABM-185

    We used dcRMA to run 4235 CEL files of E-TABM-185 data set under S = 30 CEL files, B = 50 resampling times and N = 6 available CPUs. The running time was around 20 h and the memory usage was about 800 MB. Compared with the implementation of RMA on 4235 CEL files as a single batch by a supercomputer, of which the run time was around 6 h and the peak memory usage was about 64 GB, much less memory was

  • A divide-and-conquer strategy to solve the out-of-memory 401

    required by dcRMA. The result implies that we can solve the out-of-memory problem, with increased yet acceptable running time, when dealing with large data set by using dcRMA.

    2.4 Validation of dcRMA results against the golden standard

    The data set of 4235 CEL files was used as a singe batch to run RMA on the supercomputer and the resulting gene expression values were used as the golden standard, against which the dcRMA result was validated.

    We used the correlation coefficient to verify the similarity between dcRMA result and golden standard. For each CEL file, a correlation coefficient was calculated between the summarised gene expression values resulted from dcRMA and the golden standard. In order to investigate the resampling times, the calculation was done between each dcRMA result under different resampling times (B = 1 to 50) and the gold standard. Thus, 4235 correlation coefficients were obtained for each comparison between dcRMA result and the golden standard. The boxplots of the distribution of correlation coefficients for different resampling times are presented in Figure 2. We can see a highly correlated relationship between the summarised gene expression values resulted from dcRMA and the golden standard. The boxplots for each B = 1 to 50 in Figure 2 also show that the correlation increased as the resampling times increased. In addition, a pattern of convergence can be observed from the zoomed boxplots of correlation coefficient in Figure 3. The results of dcRMA were converged when B = 14 resampling times.

    Figure 2 The boxplots of the distribution of correlation coefficients between the summarised gene expression values resulted from dcRMA and the golden standard under different resampling times (B = 1 to 50)

  • 402 C-J. Lee et al.

    Figure 3 The zoomed boxplots of correlation coefficient in Figure 2. A pattern of convergence can be observed when B = 14

    Because most of the machine learning methods, either unsupervised clustering or supervised models, rely on distance-based metrics, the small discrepancy between the dcRMA and the golden standard was investigated to see if it makes any difference. The inter-sample distance matrices were calculated for the golden standard and each dcRMA run of B = 1 to 50. The distances of distance matrix between each B and the golden standard were calculated for comparison. The result is shown in Figure 4. We can see that the distance of the inter-sample distance matrix between each B and the golden standard decreased as B increased. As with the correlation coefficient, a pattern of convergence can be observed when B = 30.

    Figure 4 The distances of inter-sample distance matrices between the golden standard and each dcRMA run of B = 1 to 50

    2.5 Comparison with aroma.affymetrix

    We tried to use aroma.affymetrix to preprocess the whole data set of E-TABM-185. However, due to file format problem, only 1542 CEL files can be preprocessed successfully. The running time was around 24 h and the peak memory usage was about

  • A divide-and-conquer strategy to solve the out-of-memory 403

    600 MB. Compared with the implementation of RMA on 4235 CEL files as a single batch by a supercomputer, of which the run time was around 6 h and the peak memory usage was about 64 GB, muss less memory was required by aroma.affymetrix. It showed that out-of-memory problem can be solved by using aroma.affymetrix. However, the run time of the whole 5896 CEL files expected to take seven days, which is much longer than our dcRMA implementation. That is because our dcRMA takes full advantage of the multi-CPU (multi-core) architecture while the traditional aroma.affymetrix package or affy package is only single threaded.

    3 Concluding remarks

    Out-of-memory is one of the most frequently encountered problems when using R/Bioconductor to deal with large data set. In this study, the computational limit of R/Bioconductor used to process thousands of CEL files generated by the Affymetrix platform was found. The maximum of 4235 CEL files could be loaded in the supercomputer for the implementation of the RMA function. This study also illustrates empirically that dcRMA, divide-and-conquer approach combined with randomised resampling, is a simple and efficient algorithm to solve the out-of-memory problem of processing thousands of Affymetrix microarray data. It works recursively by randomly breaking down a large data set into many small subsets of data, which is simple enough to be solved directly in the memory of a typical commodity computer. The solutions to the subsets are then combined to give a solution to the original bulk of data. The divide-and-conquer strategy is proposed as the general solution to run any established pre-processing algorithms of Affymetrix in the Bioconductor package on a typical commodity computer cluster (32-bit and 1 GB of memory for each CPU).

    The flexibility is another feature of dcRMA. Different values of parameters B, S and N can be set to fit the available computing capacity. The choice of N and S is usually based on the maximum number of CEL files that can be processed given the memory limitation and the number of available CPUs respectively. As for B, we suggest that 30 to 50 is good enough to ensure the convergence is achieved. This is based on the evidence that, with the 4235 CEL files from the E-TABM-185 data set, running dcRMA with resampling times of 50 and divided sample size of 30 generated nearly identical summarised gene expression values compared with a golden standard RMA. Moreover, convergence could be observed around resampling times of 14 if correlation coefficient was used for comparison and 30 if distance of distance matrics was used. This rule of choosing parameters may serve as a reference for users who will apply dcRMA with different size of data set in a practical setting. After the preprocessing steps, the high level analysis such as identifying differentially expressed genes and coexpression gene clusters could be conducted accordingly.

    4 Materials and methods

    4.1 Data set

    The Critical Assessment of Microarray Data Analysis (CAMDA) 2007 META-analysis data set E-TABM-185 (CAMDA, 2007) was used to test the computational limit on

  • 404 C-J. Lee et al.

    a supercomputer and to test the divide-and-conquer strategy. The E-TABM-185 data set is offered as a challenge to large-scale data analysis. The data set contains 5896 CEL files of diseased and normal human tissue samples and cell lines collected from ArrayExpress and NCBI GEO (Table 1). All the samples were hybridised with the Affymetrix GeneChip Human Genome HG-U133A.

    Table 1 Composition of data set E-TABM-185

    Type of sample No. of array

    Cell lines 1142 Normal tissues and cell types 1278 Diseases and syndromes 3476 Leukemia 748 Breast cancer 807 Other cancers 954 Non-cancer diseases and syndromes 967 Total 5896

    4.2 The divide-and-conquer algorithm

    dcRMA: dividing steps

    A resampling method is used to generate small subsets of CEL files. Variable A is assigned as the total number of CEL files in the large data set. Variable B is assigned as the number of resampling times. Variable S is assigned as the number of CEL files in a subset. The number of resampling times should be reasonably large to catch as many as possible CEL file combinations. The choice of S is usually based on the maximum number of CEL files that can be processed given the current memory limitation. Do the dividing steps iteratively by, firstly, randomly shuffling the order of the CEL files, and then, sequentially dividing the CEL files into small subsets of S CEL files. In case that the total number of CEL files in the data set (A) can not be exactly divided by S, cycling from the beginning is made to ensure each subset having the same number of CEL files (S).

    dcRMA: generating a job queue

    Each subset of CEL files is defined as a Job. There will be totally B × (A/S, rounded to the next integer) Jobs. All of these Jobs will be saved in the Job Queue and waiting for dispatching to an available CPU to do the RMA analysis. The allocation of each job to a different CPU was accomplished by the fork package that handles distributed processes. The variable N is assigned as the number of available CPUs. Once a Job is assigned to a CPU, the Job will be deleted from the Job Queue.

    dcRMA: summarisation of expression measures of each CEL file

    Since the resampling times are reasonably large, each CEL file is likely to be analysed many times in different Jobs. Thus, the normalised value of S CEL files in one Job will be saved separately. A matrix is then created for each CEL file to merge and average the

  • A divide-and-conquer strategy to solve the out-of-memory 405

    normalised gene expression values from different jobs. This summarised value is taken as a final output, representing gene expression measures of each CEL file.

    4.3 Software to implement dcRMA

    R code for implementation of dcRMA is attached as a supplementary material.

    Acknowledgement

    We thank Henrik Bengtsson at Lund University, Sweden for consultation on aroma.affymetrix.

    References Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F.,

    Soboleva, A., Tomashevsky, M. and Edgar, R. (2007) ‘NCBI GEO: mining tens of millions expression profiles – database and tools update’, Nucleic Acids Research, Vol. 35, pp.D760–D765.

    BML (2007) Bioconductor Mailing List, Obtained through the Internet, http://www. bioconductor.org/docs/mailList.html [accessed 1/1/2007-1/9/2007].

    CAMDA (2007) Critical Assessment of Microarray Data Analysis, Obtained through the Internet, http://camda.bioinfo.cipf.es/camda07/call_for_papers/index.hetml [accessed 13/12/2007].

    EMBL-EBI ArrayExpress (2007) European Molecular Biology Laboratory-European Bioinformatics Institute, ArrayExpress, Obtained through the Internet, http://www.ebi.ac.uk/ microarray-as/aer/?#ae-main [accessed 10/1/2007].

    Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. and Zhang, J. (2004) ‘Bioconductor: open software development for computational biology and bioinformatics’, Genome Biology, Vol. 5, No. 10, Article R80.

    Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B. and Speed, T.P. (2003) ‘Summaries of Affymetrix GeneChip probe level data’, Nucleic Acids Research, Vol. 31, No. 4, p.e15.

    NCBI GEO (2007) National Center for Biotechnology Information, Gene Expression Omnibus, Obtained through the Internet, http://www.ncbi.nlm.nih.gov/geo/ [accessed 10/1/2007].

    Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena, N., Coulson, R., Farne, A., Holloway, E., Kolesnykov, N., Lilja, P., Lukk, M., Mani, R., Rayner, T., Sharma, A., Sarkans, U. and Brazma, A. (2007) ‘ArrayExpress – a public database of microarray experiments and gene expression profiles’, Nucleic Acids Research, Vol. 35, pp.D747–D750.

    Trelles, O. (2001) ‘On the parallelisation of bioinformatics applications’, Briefings in Bioinformatics, Vol. 2, No. 2, pp.181–194.

    Vera, G., Jansen, R.C. and Suppi, R.L. (2008) ‘R/parallel – speeding up bioinformatics analysis with R’, BMC Bioinformatics, doi:10.1186/1471-2105-9-390.