Paper Title (use style: paper title)€¦  · Web viewNowadays, large amounts of raw...

9
A meta-analysis portal for human breast cancer transcriptomics data: BreastCancerVis Petronela Buiga Faculty of Biology, Medicine and Health University of Manchester Manchester, UK [email protected] ester.ac.uk Jamie Soul Faculty of Biology, Medicine and Health University of Manchester Manchester, UK [email protected] Jean-Marc Schwartz Faculty of Biology, Medicine and Health University of Manchester Manchester, UK jean- [email protected] XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Transcript of Paper Title (use style: paper title)€¦  · Web viewNowadays, large amounts of raw...

Page 1: Paper Title (use style: paper title)€¦  · Web viewNowadays, large amounts of raw transcriptomics data are deposited in specialized public repositories, such as ArrayExpress and

A meta-analysis portal for human breast cancer transcriptomics data: BreastCancerVis

Petronela BuigaFaculty of Biology, Medicine and Health

University of ManchesterManchester, UK

[email protected]

Jamie SoulFaculty of Biology, Medicine and Health

University of ManchesterManchester, UK

[email protected]

Jean-Marc SchwartzFaculty of Biology, Medicine and Health

University of ManchesterManchester, UK

[email protected]

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Page 2: Paper Title (use style: paper title)€¦  · Web viewNowadays, large amounts of raw transcriptomics data are deposited in specialized public repositories, such as ArrayExpress and

Abstract— Breast cancer is a major disease posing many therapeutic and societal challenges. The major problem remains the diversity of the disease. Breast cancers are

divided in several distinct subtypes, but the specific mechanisms involved in each subtype are not understood. We apply advanced network medicine techniques to the

analysis of transcriptomics data of several breast cancer cell lines. We show that different subtypes are characterized by highly heterogeneous pathway activity. Using the overlap between differentially expressed genes and drug-induced

expression changes, we present candidate drugs for repurposing. Furthermore, we find different active

subnetworks across the subtypes, regulated by distinct transcription factors. All these results are made available via

a user-friendly portal at http://phenome.manchester.ac.uk/breastcancer/ .

Keywords—Breast cancer, transcriptomics, network medicine, pathways, differential expression, web portal.

I. INTRODUCTION

Breast cancer is one of the most common types of cancer among women and is therefore a leading cause of cancer-related death. Approximately one in eight women in the Western world develops breast cancer throughout her life [1]. Breast cancer is a heterogeneous group of various disease subtypes, each with its own clinical and biological features [2]. Based on gene expression profiling, studies have presented widely applied molecular classification of breast cancers consisting of three main subtypes: luminal, basal-like and HER2-positive [3]. To this classification can also be added rare subtypes of breast cancer such as claudin-low [4].

The transcriptome in cancer contains RNA-based alterations including changes in messenger RNAs. Based on these changes, gene transcripts can be used to identify different breast cancer subtypes. Gene expression profiling enables the identification of gene signatures to predict prognosis and guide the use of adjuvant therapy [5]. Gene expression profiling or transcriptomics is beneficial for clinical decision-making in breast cancer treatment [6]. In clinical practice, data from genes, transcripts and proteins are used for cancer monitoring starting from diagnosis to development of novel therapies [7]. Moreover, transcriptomics-based analyses produced an improved prediction of breast cancer survival when compared to standard systems based on histological, physical and clinical criteria such as patient age, tumour size and grade, or axillary lymph nodes affected [5, 8].

Daemen et al. (2013) examined approaches for predictive marker development by evaluating the responses of 70 breast cancer cell lines to 90 compounds. Two independent machine learning methods were used to identify pretreatment molecular features associated with responses to drugs within the cell line panel. The researchers identified correlations between the molecular features and drug responses by applying machine learning-based methods. They also developed a software package to predict compound efficacy in individual tumours based on their omics features [9].

Nowadays, large amounts of raw transcriptomics data are deposited in specialized public repositories, such as ArrayExpress and GEO, which offer the option to reanalyse the transcriptomics data and meta-data [10, 11]. The analysis of existing datasets generally concludes in a list of differentially expressed genes, which are then functionally annotated [12, 13].

The field of network medicine offers new opportunities to explore the molecular relationships between cellular components involved in diseases [14, 15]. Using these approaches, new disease-related genes, drug targets and biomarkers can be identified [16]. Here, we aim to reanalyse some of the existing breast cancer transcriptomics data using state-of-the-art network-based techniques, in order to better characterize the pathways and cellular processes that distinguish the different subtypes. We developed a web portal to permit exploration and comparisons of publicly available breast cancer transcriptomics data in order to distinguish the different subtypes and reveal their underlying molecular profiles. The BreastCancerVis portal is an easy to use tool for bench scientists which does not require the download of raw data. It enables these data to be examined and the similarities or differences between various breast cancer subtypes to be investigated.

II. METHODS

RNA-seq fastq files and meta-data were downloaded from SRA [17]. Using the Ensembl transcriptome reference (release 79), pseudo-alignment and quantification of reads was performed with Kallisto [18]. The RNA-seq data was processed according to the following steps: MultiQC to generate summary reports of the FastQC read statistics and Kallisto mapping logs for quality control [19]; Tximport to summarise the mapped transcript level counts to gene level counts [20]; sva to perform batch effect correction with automatic identification of the number of surrogate variables [21]. Outliers were identified using Principal Component Analysis (PCA) plotting and removed from subsequent analysis. Fold changes and p-values of all genes and in all comparisons were calculated using DESeq2 [22] with Benjamini-Hochberg correction for multiple testing. The surrogate variables were incorporated as co-variants in DESeq2 to correct for the identified batch effects.

These outcomes were further analysed in a downstream pipeline consisting of: differentially expressed gene identification with an absolute fold-change of 1.5 and an adjusted p-value threshold of 0.05; Gene Ontology biological process (GOBP) enrichment using GoSeq [23] with an adjusted p-value threshold of 0.05; redundancy reduction using the Revigo algorithm with the Resnik semantic similarity threshold set to 0.4 [24]; pathway enrichment using a previously described reduced set of pathways [25].

The STRING database was used to reconstruct the human protein-protein interaction network, filtered using an edge confidence threshold of >400, then removing text-mining derived edges and keeping the largest connected component [26]. The database RcisTarget was used to identify putative transcription factor motifs [27]. Active

Page 3: Paper Title (use style: paper title)€¦  · Web viewNowadays, large amounts of raw transcriptomics data are deposited in specialized public repositories, such as ArrayExpress and

subnetworks were computed using the GIGA algorithm [28]. Amigo was used to investigate major functions of the subnetworks.

We used the LINCS L1000 perturbation database, accessed with the L1000CDS2 API, to find overlaps between the gene expression signatures and the observed differential expression to identify potential drugs that could reverse observed dysregulated functions in the cancer subtypes [29]. Networks were visualized using the yWorks yEd graph editor. All these analyses were made available on a web portal at http://phenome.manchester.ac.uk/breastcancer/ . The website was implemented using the R Shiny library [30].

III. RESULTS

Our main aim was to analyse data from different subtypes of breast cancer in order to identify the differences that distinguish these subtypes. We used an existing breast cancer transcriptomic dataset consisting of 56 samples [9] and processed it using an in-depth transcriptomic analysis pipeline. The raw data analysed through this pipeline generated several expression response profiles with quality control and PCA, differential expression and comprehensive downstream analysis comprising of pathways, GO terms, drug enrichment and active sub-networks (Figure 1). The dataset was human and RNA-Seq based.

Fig. 1: Schematic diagram of the BreastCancerVis analysis pipeline.

A. Differentially expressed genes (DEG)The initial step of the analysis determined the similarities

of gene expression profiles in the different breast cancer subtypes in order to assess relations between them. In the DEG analysis, we included various conditions indicating comparisons of non-malignant breast cancer with other subtypes such as: Luminal, Claudin-low, Basal and Unknown.

PCA was used to visualise overall relations between gene expression profiles. Figure 2A shows the separation of the samples by the top two principal components explaining the variation after any batch effect correction. It shows a good discrimination between the three main cancer subtypes (Basal, Claudin-low and Luminal).

We continued by comparing DEGs in different subtypes of breast cancer in order to find whether they are distinct or common to several subtypes (Figure 2B). We observed that the breast cancer subtypes have a different genomic signature, even though they belong to the same cancer type. The smallest overlap is observed between Claudin-low and Unknown, which only have 25 DEGs in common. A lot of DEGs are specific to one subtype; the highest number of specific DEGs is observed in the Luminal subtype (3290).

These results emphasize the importance of distinguishing between the various subtypes and not analysing breast cancer as a single, homogeneous disease.

Fig. 2: A. Principal Component Analysis of 56 breast cancer gene expression samples showing different subtypes of the disease. B. Venn diagram of differentially expressed genes in the breast cancer subtypes. NMU (Non-malignant vs. Unknown); BNM (Basal vs. Non-malignant); LNM (Luminal vs. Non-malignant); CLNM (Claudin-low vs. Non-malignant).

B. Gene Ontology (GO)The gene ontology is a standardized vocabulary used to

characterize the functions of genes [31]. We used the biological process aspect, which describes the biological functions that the genes contribute to. We listed the DEGs in all the conditions to find the set of GO terms that were enriched. The most enriched GO terms across all the conditions were: “cornification” (6E-23), “skin development” (3E-16), “epithelial cell differentiation“ (3E-13), “extracellular matrix organization” (8E-13), “epidermis development” (2E-10), “regulation of cell migration” (3E-8) and “blood vessel development” (7E-8). These functions are all related to tumour development, however this mode of analysis does not clearly discriminate between functions enriched in different subtypes.

In order to better visualize the commonalities and the differences between the cancer subtypes we represent these results in the form of a bipartite network, where one subset of nodes are conditions and the others are GO terms; each GO term is connected to a condition if the p value of the GO term in the condition is below 0.05 (Figure 3). This visualisation shows that there are GO terms common to more than one condition: “regulation of cell adhesion”, “peptide cross-linking” and “extracellular matrix organization” are present in five conditions; “hemidesmosome assembly”, “tube development” and “regulation of cell migration” in conditions; “response to wounding”, “epithelial cell proliferation”, “ossification”, “regulation of receptor activity”, “epithelial cell differentiation”, “intermediate filament-based process”, “response to estrogen”, “gland morphogenesis” and “blood vessel development” in three conditions.

Conversely, there are GO terms associated with a single subtype of breast cancer, for example: “inflammatory response” in Claudin-low vs Basal, “cell fate commitment” in Basal vs Non-malignant, “cell adhesion molecule production” in Luminal vs Non-malignant. The condition with the highest number of unique interactions is Claudin-low vs Basal (20 unique GO terms). No enriched GO term was associated with Unknown vs Non-malignant and Basal vs Unknown conditions.

Page 4: Paper Title (use style: paper title)€¦  · Web viewNowadays, large amounts of raw transcriptomics data are deposited in specialized public repositories, such as ArrayExpress and

Fig. 3: Bipartite network of conditions and enriched GO terms. A link between a condition and a GO term means that this GO term is enriched with a p value below 0.05. Blue nodes represent the GO terms and red nodes represent the conditions.

C. PathwaysPathways are grouping of proteins contributing to defined

cellular processes. Pathway analysis is widely used to give a description of activated cellular functions in certain conditions [32]. For a more precise characterisation we used a reduced set of pathways presented in [25], since existing pathway databases contain redundancies.

We listed the pathways that contained a significant number of DEGs in each condition. We found differentially expressed pathways (DEP) in 7 out of 10 conditions; no DEP was found in Basal vs Non-malignant, Basal vs Unknown and Non-malignant vs Unknown.

The most enriched DEPs were: “beta1 integrin cell surface interactions” (6E-6), “collagen biosynthesis and modifying enzymes” (5E-5), “hematopoietic stem cell differentiation” (1E-4), “nuclear receptors meta pathway” (2E-4), “cytokine receptor interaction” (5E-4) and “vitamin D receptor pathway” (8E-4).

To better visualize the commonalities and the differences between the cancer subtypes we again represented these results in the form of a bipartite network, where one subset of nodes are conditions and the others are pathways (Figure 4). There are few pathways shared between several conditions, for example “beta1 integrin cell surface interactions”, “type i hemidesmosome assembly” and “gpcr signaling g alpha s epac and erk”. Conversely, more pathways tend to be unique to one condition, for example in Luminal vs Unknown: “ncam1 interactions”, “tnf signaling pathway”, “jak stat core”, “senescence associated secretory phenotype sasp”, “peptide ligand binding receptors” and “validated transcriptional targets of ap1 family members fra1 and fra2”. These results show that pathways are more efficient at discriminating between the different cancer subtypes than GO terms.

Fig. 4: Bipartite network of conditions and pathways. A link between a condition and a pathway means that this pathway is enriched with a p value below 0.05. Yellow nodes represent the pathways and red nodes represent the conditions.

D. Drug enrichmentWe compared the gene expression responses of drugs

with the set of DEGs in order to discover new candidates for drug repurposing [30]. We used the L1000CDS2 database to search for significant overlaps between drug responsive genes and DEGs. We focused on the reverse drug response, which means that suitable drug candidates should be able to decrease the expression of overexpressed genes and vice-versa. Some drugs are among the top candidates in several conditions, for example: Phorbol-12-myristate-13-acetate (PMA) appears in 7 out of 10 conditions; Mitoxantrone, Ingenol 3, 20-dibenzoate, Mitomycin C, CPT 11 and Kinetin riboside appear in 4 out of 10 conditions. Some drugs appear more specifically suited to a particular breast cancer subtype, for example: Prostratin and Ingenol 3, 20-dibenzoate appear in 3 out of 4 comparisons of Luminal vs other subtypes; Mitomycin C appears in 3 out of 4 comparisons of Claudin-low vs other subtypes.

E. Active subnetworksBiological processes are the result of interactions

between multiple proteins; these interactions can be represented using graphs where the edges represent different types of relations, such as physical protein-protein interactions, co-expression and common functional annotations [28]. Therefore, it is relevant to combine DEGs with an interaction network to identify “active” subnetworks that show a significant number of connected DEGs. We used the STRING database and applied the GIGA algorithm to find significantly enriched subnetworks for each condition. These subnetworks point towards processes that are unique to each condition. The number of active subnetworks was relatively consistent among all conditions, ranging from 23 to 30. The maximum size of active subnetworks ranged from 13 to 19.

Page 5: Paper Title (use style: paper title)€¦  · Web viewNowadays, large amounts of raw transcriptomics data are deposited in specialized public repositories, such as ArrayExpress and

F. Transcription factorsTFs regulate the expression of genes, therefore they give

important clues on the control mechanism of dysregulated processes in various diseases. We used the RcisTarget database to find putative binding targets among the DEGs and we looked for the overlap between the targets of the TFs and the DEGs in each condition [27]. The highest number of enriched motifs was found in the Claudin-low vs Unknown condition (120).

G. Integration of active subnetworks with transcription factorsSubnetworks represent dysregulated processes and TFs

are potential controllers of these processes. Examples of integration results are shown in Figure 5. For the Luminal vs Non-malignant condition, Figure 5A shows a subnetwork of genes mainly involved in the G-protein coupled receptor signalling pathway, which are regulated by RAD21, PURA, DDX53, ZFP3 and SMAD3. The same TFs regulate the subnetwork shown in Figure 5B, which is involved in establishment of localization and cellular component organization. For the Claudin-low vs Non-malignant condition, Figure 5C shows a subnetwork mainly involved in extracellular matrix organisation and tissue development, which are regulated by TP53, PIK3C3, ZNF480 and VDR. The same TFs regulate the subnetwork shown in Figure 5D, which is involved in the immune response and signal transduction.

Fig. 5: Integration of active subnetworks and transcription factors for two conditions: A-B. Luminal vs Non-malignant; C-D. Claudin-low vs Non-malignant. Green nodes represent differentially expressed genes and other colours represent transcription factors.

IV. DISCUSSION

We based our analysis on a breast cancer transcriptomics dataset published in [9], which investigated the association between expression, copy number and methylation data. These were measured by correlations at the cell line and gene levels. Machine learning was used to train a predictor of the response of different breast cancer subtypes to different drugs based on multiple pretreatment data. However,

correlations are not necessarily an indication of a causal relationship. Network approaches enable us to move towards a more mechanistic understanding of complex diseases [14]. In particular, by integrating omics dataset with graph-based representations, it is possible to reveal new associations between genes, which are not only based on similarity of expression profiles but also on physically validated interactions.

Although breast cancer is a widely investigated disease, it is still challenging to treat due to its heterogeneity. Better understanding of the dysregulated genes, pathways and functions in different subtypes of breast cancer is necessary to lead to more personalized therapies [33]. We compared five different breast cancer subtypes and showed that this brings about original knowledge about differences between the subtypes. Our pathway analysis revealed several pathways associated to cancer such as: “direct p53 effectors”, “apoptotic cleavage of cellular proteins”, “inflammatory response” and “gpcr signalling”. In particular, we showed that there is a higher diversity among the pathways associated to different subtypes than previously thought. Few DEPs were shared across several subtypes and the majority of DEPs were specific to a single subtype, enabling better discrimination than when DEGs are considered individually. This observation points towards more specific mechanisms for the different subtypes.

Mechanisms can be further understood by looking at active subnetworks and the associated TFs. We found multiple active subnetworks in all conditions consisting of up to 30 DEGs. In some cases, we were able to identify TFs with strong regulatory associations to active subnetworks. For example, TP53 is a well-known tumour suppressor which is involved in the regulation of multiple biological processes associated to cancer [34]; SMAD3 was shown to be involved in both endometrial and breast cancer [35]; DDX53 is known to act as an oncogene and to regulate responses to several anti-cancer drugs [36]. In addition, the overlap between active subnetworks and known drug responses provides valuable new information about potential therapies.

Given the wealth of available omics data on breast cancer, this study should be extended to multiple datasets in the future in order to get a more comprehensive coverage of breast cancer variety. Moreover, experimental conditions and instrumentation used to collect transcriptomics data are highly variable, making it difficult to compare different studies. Therefore, strict normalization techniques are required to align different datasets and make them comparable, which often requires re-running the whole analysis using the raw data. This can only be achieved with the development of suitable bioinformatics tools to enable fast and semi-automated processing of large amounts of data.

Another issue that limits the re-use of such datasets is that they are often presented in formats that are challenging for use by non-bioinformaticians. For this reason, it is important to provide user-friendly interfaces to explore differential expression patterns and network analysis results in multiple conditions, so that they will be easily accessible

Page 6: Paper Title (use style: paper title)€¦  · Web viewNowadays, large amounts of raw transcriptomics data are deposited in specialized public repositories, such as ArrayExpress and

to clinicians and biologists. BreastCancerVis is a unique, user-friendly web portal developed to fulfil these needs; it reduces the time needed to explore and visualize complex analyses and also offers new types of network-based analyses not found in conventional tools. In the future, we will enrich this portal by incorporating multiple breast cancer transcriptomics datasets to offer comprehensive coverage of breast cancer diversity.

ACKNOWLEDGMENTS

JS and JMS were supported by the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 602300 (SYBIL). PB was supported by a studentship funded by Weizmann UK and by the Faculty of Biology, Medicine and Health, University of Manchester.

REFERENCES

[1] Spanhol F, Oliveira L, Petitjean C, Heutte L: A Dataset for Breast Cancer Histopathological Image Classification. IEEE Transactions on Biomedical Engineering 2015:1-1.

[2] Cancer Genome Atlas N: Comprehensive molecular portraits of human breast tumours. Nature 2012, 490(7418):61-70.

[3] Weigelt B, Horlings HM, Kreike B, Hayes MM, Hauptmann M, Wessels LFA, de Jong D, Van de Vijver MJ, Veer LJVt, Peterse JL: Refinement of breast cancer classification by molecular characterization of histological special types. J Pathol 2008, 216(2):141-150.

[4] Prat A, Parker JS, Karginova O, Fan C, Livasy C, Herschkowitz JI, He X, Perou CM: Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer. Breast Cancer Res 2010, 12(5):R68.

[5] Kwa M, Makris A, Esteva FJ: Clinical utility of gene-expression signatures in early stage breast cancer. Nat Rev Clin Oncol 2017, 14(10):595-610.

[6] Rhodes DR, Chinnaiyan AM: Integrative analysis of the cancer transcriptome. Nat Genet 2005, 37 Suppl:S31-37.

[7] Jameson J. L., Longo DL: Precision Medicine - Personalized, Problematic, and Promising. N Engl J Med 2015.

[8] van de Vijver MJ, He YD, van’t Veer L, Dai H, Hart A, Voskuil D, Schreiber G, Petrse J, Glas A, Delahaye L et al: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 2002, 347(25).

[9] Daemen A, Griffith OL, Heiser LM, Wang NJ, Enache OM, Sanborn Z, Pepin F, Durinck S, Korkola JE, Griffith M et al: Modeling precision treatment of breast cancer. Genome Biol 2013, 14(10).

[10] Kolesnikov N, Hastings E, Keays M, Melnichuk O, Tang YA, Williams E, Dylag M, Kurbatova N, Brandizi M, Burdett T et al: ArrayExpress update -simplifying data submissions. Nucleic Acids Res 2015, 43(Database issue):D1113-1116.

[11] Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30(1):207-210.

[12] Blumenberg M: Profiling and metaanalysis of epidermal keratinocytes responses to epidermal growth factor. BMC Genomics 2013, 14:85.

[13] Newton R, Wernisch L: Investigating inter-chromosomal regulatory relationships through a comprehensive meta-analysis of matched copy number and transcriptomics data sets. BMC Genomics 2015, 16:967.

[14] Barabasi AL, Gulbahce N, Loscalzo J: Network medicine: a network-based approach to human disease. Nat Rev Genet 2011, 12(1):56-68.

[15] Csermely P, Korcsmaros T, Kiss HJ, London G, Nussinov R: Structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review. Pharmacol Ther 2013, 138(3):333-408.

[16] Soul J, Hardingham TE, Boot-Handford RP, Schwartz J-M: PhenomeExpress: A refined network analysis of expression datasets by inclusion of known disease phenotypes. Sci Rep 2015, 5:8117.

[17] Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database C: The sequence read archive. Nucleic Acids Res 2011, 39(Database issue):D19-21.

[18] Bray NL, Pimentel H, Melsted PA-O, Pachter L: Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol, 34, 525–527, 34:525-527.

[19] Ewels P, Magnusson M, Lundin S, Kaller M: MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32:3047–3048.

[20] Soneson C, Love MI, Robinson MD: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4.

[21] Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD: The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 2012, 28(6):882-883.

[22] Love MI, Huber W, Anders S: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014, 15(12):550.

[23] Young MD, Wakefield MJ, Smyth GK, Oshlack A: Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol 2010, 11(2):R14.

[24] Supek F, Bosnjak M., Skunca N, Smuc T: REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One, 6.

[25] Stoney R, Robertson DL, Nenadic G, Schwartz JM: Mapping biological process relationships and disease perturbations within a pathway network. NPJ Syst Biol Appl 2018, 4:22.

[26] Mering Cv, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B: STRING: a database of predicted functional associations between proteins. Nucleic Acids Research 2003, 31(1):258-261.

[27] Aibar S, Gonzalez-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine JC, Geurts P, Aerts J et al: SCENIC: single-cell regulatory network inference and clustering. Nat Methods 2017, 14(11):1083-1086.

[28] Breitling R, Amtmann A, Herzyk P: Graph-based iterative Group Analysis enhances microarray interpretation. BMC Bioinformatics 2004, 5:100.

[29] Duan Q, Reid SP, Clark NR, Wang Z, Fernandez NF, Rouillard AD, Readhead B, Tritsch SR, Hodos R, Hafner M et al: L1000CDS(2): LINCS L1000 characteristic direction signatures search engine. NPJ Syst Biol Appl 2016, 2.

[30] Team RC: R: A language and environment for statistical computing. R Foundation for Statistical Computing. R Foundation for Statistical Computing 2013.

[31] Consortium TGO: Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res 2017(1362-4962 (Electronic)).

[32] Schwartz J, Gaugain C, Nacher JC, Daruvar A, Kanehisa M: Observing metabolic functions at the genome scale. Genome Biology 2007, 8(6).

[33] Buckley N, Boyle D, McArt D, Irwin D, Harkin DP, Lioe T, McQuaid S, James JA, Maxwell P, Hamilton P et al: Molecular classi cation of non-invasive breast lesions for personalised therapy and chemoprevention. Oncotarget 2015, 6(41):43244-43254.

[34] Tian K, Rajendran R, Doddananjaiah M, Krstic-Demonacos M, Schwartz J-M: Dynamics of DNA Damage Induced Pathways to Cancer. PLoS ONE 2013, 8(9):e72303.

[35] Baxter E, Windloch K, Kelly G, Lee JS, Gannon F, Brennan DJ: Molecular basis of distinct oestrogen responses in endometrial and breast cancer. Endocr Relat Cancer 2018, 25(11).

[36] Kim Y, Yeon M, Jeoung D: DDX53 Regulates Cancer Stem Cell-Like Properties by Binding to SOX-2. Mol Cells 2017, 40(5):322-330.