certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State...
Transcript of certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State...
1
Integrative Network Analysis of Differentially Methylated and Expressed 1
Genes for Biomarker Identification in Leukemia 2
Robersy Sanchez* and Sally A. Mackenzie* 3 4
Departments of Biology and Plant Science, The Pennsylvania State University, University Park, 5
PA 16802. 6
Running Title: Network Analysis in Leukemia 7
Corresponding Authors: 8
Robersy Sanchez 9
361 Frear North Bldg 10
Pennsylvania State University 11
University Park, PA 16802 12
Email: [email protected] 13
14
Sally Mackenzie 15
362 Frear North Bldg 16
Pennsylvania State University 17
University Park, PA 16802 18
Email: [email protected] 19
20
21
Abstract 22
Genome-wide DNA methylation and gene expression are commonly altered in pediatric acute 23
lymphoblastic leukemia (PALL). Integrated analysis of cytosine methylation and expression 24
datasets has the potential to provide deeper insights into the complex disease states and their 25
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
2
causes than individual disconnected analyses. Studies of protein-protein interaction (PPI) 26
networks of differentially methylated (DMGs) and expressed genes (DEGs) showed that gene 27
expression and methylation consistently targeted the same gene pathways associated with cancer: 28
Pathways in cancer, Ras signaling pathway, PI3K-Akt signaling pathway, and Rap1 signaling 29
pathway, among others. Detected gene hubs and hub sub-networks are integrated by signature 30
loci associated with cancer that include, for example, NOTCH1, RAC1, PIK3CD, BCL2, and 31
EGFR. Statistical analysis disclosed a stochastic deterministic dependence between methylation 32
and gene expression within the set of genes simultaneously identified as DEGs and DMGs, 33
where larger values of gene expression changes are probabilistically associated with larger 34
values of methylation changes. Concordance analysis of the overlap between enriched pathways 35
in DEG and DMG datasets revealed statistically significant agreement between gene expression 36
and methylation changes, reflecting a coordinated response of methylation and gene-expression 37
regulatory systems. These results support the identification of reliable and stable biomarkers for 38
cancer diagnosis and prognosis. 39
Introduction 40
Network-based modeling approaches have the potential to integrate and improve the perception 41
of complex disease states and their root causes. To date, network analysis provides reliable and 42
cost effective approaches for early disease detection, prediction of co-occurring diseases and 43
interactions, and drug design 1. Although integrated genomic analysis of methylation and gene 44
expression in leukemia have been performed 2–5, an integration including network analysis of 45
methylation and gene expression is still missing. 46
Our study investigates protein-protein interaction networks (PPI), which are exclusively 47
focused on protein-protein associations and resulting cell activities. A PPI network can be 48
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
3
defined as a (un)directed graph/network holding vertices as proteins (or protein-encoding genes) 49
and edges as the interactions/association between them. Associations are meant to be specific 50
and biologically meaningful, i.e., two proteins (genes) are connected by an edge if jointly 51
contributing to a shared function, which does not necessarily reflect a physical binding 52
interaction. 53
Within the network, some proteins denote hubs interacting with numerous partners. 54
Biologically, hubs are key elements on which functionality of the cellular process modeled by 55
the network depends. Consequently, it is reasonable to assume that a biomarker suitable to define 56
specific disease states would likely be a hub or a hub regulator within a relevant network. 57
Frequently, more than one interacting network model is possible, with each model carrying a 58
different uncertainty level for the biological process under study. Integration of more than one 59
network model can help to reduce the implicit uncertainty associated to each model prediction6. 60
Here, we address the hypothesis that disease-induced DNA methylation changes can serve 61
as a source of reliable and stable biomarkers for cancer diagnosis and prognosis. Toward that 62
aim, aberrant DNA methylation of key genes was reported in Acute Lymphoblastic Leukemia 63
(ALL) 6. We report on a reproducible approach integrating network analysis of DMGs, DEGs 64
and DEGs-DMGs estimated within datasets from patients with pediatric ALL (PALL). Such an 65
integration may provide the basis for robust identification of reliable and stable biomarkers for 66
cancer diagnosis and prognosis. 67
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
4
Results 68
General methylation features of the study 69
Differentially methylated positions (DMPs) were estimated for control (four normal CD19+ 70
blood cell donors) and patient (ALL cells from three patients) groups relative to a reference 71
group of four independent normal CD19+ blood cell donors. Inclusion of a reference group 72
permitted the evaluation of natural variation in healthy individuals and reduction of noise in a 73
signal detection step of our methylation analysis pipeline. The distribution of methylation 74
changes at DMPs along the chromosome revealed a genome-wide methylation re-patterning 75
dominated by hypermethylation in PALL patients (Supplementary Fig. S1). Hypomethylated 76
sites are visible in the genome browser after zooming (tracks available in the Supplementary File 77
S1). Consistent with natural methylation variability in the population of healthy individuals, 78
DMPs were observed in the control group as well. 79
DMGs were estimated from group comparisons for number of DMPs within gene-body 80
regions between control (CD19+ blood cell donors) and ALL cells based on generalized linear 81
regression. This analysis yielded a total of 4795 DMGs, including protein-coding regions (3338) 82
and non-coding RNA genes (Supplementary Table S1). 1774 genes from the set of 2360 reported 83
(B-Cells) DEGs in the original study 7 were DMGs as well (75.2%, Supplementary Table S2). 84
The gene-body methylation signal detected in PALL patients coincided with a significant 85
number of genes from the list of all cancer consensus genes (723) from the COSMIC database8: 86
254 DMGs, and 126 DEGs, and from them 112 DEGs-DMGs. 87
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
5
Network analysis on a set of differentially methylated genes (DMGs) 88
When applying network analysis, not all DMGs and DEGs estimated from the 89
experimental datasets integrate networks. A subset of the most relevant genes from the 90
experimental dataset able to integrate networks is helpful to facilitate further network analysis. 91
The preliminary application of network-based enrichment analysis (NBEA9) and network 92
enrichment analysis test (NEAT 10) on the set of DMGs permitted the identification of 285 93
network-related DMGs (Supplementary Tables S1 and S1). Similar analysis permitted the 94
identification of 326 network-related DEGs (Supplementary Table S2-B, from B-Cells 2360 95
DEGs reported in Supplementary Table 3 from original study 7). These subsets were used to 96
build the corresponding protein-protein interaction (PPI) networks with the STRING app of 97
Cytoscape 11,12. Alternatively, to bypass possible bias introduced by the heuristic used to subset 98
the whole set of genes (NBEA9 and NEAT10), sub-clusters of hubs where retrieved applying the 99
MCODE Cytoscape app on the whole set of DMGs. 100
The PPI network built on the set of 285 DMGs is presented in Supplementary Fig. S2. 101
The analysis with available tools in Cytoscape 11 led to the identification of the main hubs from 102
the PPI network (Fig. 1A and C). Sizes of nodes and labels, as well as their colors, are used for 103
rapid identification of network hubs. Network hubs were confirmed based on betweenness-104
centrality and node degree13, such that the size of each node is proportional to its value of 105
betweenness-centrality and the label font size is proportional to its node degree. 106
The main hub subnetworks in Fig. 1A and 1C were identified with the application of K-107
means clustering on the main networks shown in Supplementary Fig. S2 and S3, respectively, 108
with network centralities measuring Degree, Betweeness-Centrality, Closeness-Centrality, 109
Clustering-Coefficient, and Average-Shortest-Path. Network enrichment analysis of the 110
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
6
subnetwork of hubs identified KEGG pathways involved in cancer development (Fig. 1B and D), 111
supporting our findings with analysis of network centralities. 112
113 Figure 1. PPI subnetworks of hubs derived from subsets of network-related DMGs. A, main subnetwork 114 of hubs obtained with the application of K-means clustering on the set of 285 network-related DMGs 115 identified with NBEA9 and NEAT10 tests. The size of each node is proportional to its value of 116 betweeness centrality and the label font size is proportional to its node degree. Node colors from light-117 green to red maps the discrete scale of logarithm base 2 of fold changes in DMP numbers for the 118 corresponding gene: light-green: [1, 2), cyan: [2, 3), blue: [3, 4), and red: 5 or more. Edge color is based 119 on co-expression index from white (0.042) to red (0.842). B, enrichment analysis with Cytoscape11 on 120 KEGG pathway sets on network in A. C, main subnetwork of hubs obtained with the application of 121 MCODE Cytoscape app and K-means clustering. D, enrichment analysis with Cytoscape11 on KEGG 122 pathway sets on the network in C. 123 124
125
K-means clustering split the network of 285 DMGs (Supplementary Fig. S2) into three 126
clusters: i) the main subnetwork of hubs (46 DMGs, shown in Fig.1A, Supplementary Table S1), 127
ii) a subnetwork with minor hubs (101 DMGs, Supplementary Fig. S4, Supplementary Table 128
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
7
S1), and iii) a cluster integrated by two subnetworks (139 DMGs, Supplementary Fig. S4, 129
Supplementary Table S1). Results with MCODE Cytoscape app and K-means were consistent 130
with those obtained by subsetting the whole set of DMGs via NBEA and NEAT 9,10 131
(Supplementary Fig. S3 and Supplementary Tables S2), with a notable enrichment of KEGG 132
pathways associated with cancer development (Supplementary Fig. S5). 133
The scatter plots of network centrality measures (Fig. 2) suggest that the main subnetwork 134
of hubs includes the most relevant network nodes/genes (in red) carrying the highest network 135
centrality measurements. We noted a transition from a non-linear behavior, in clusters iii (nodes 136
in blue) and ii (node in green), to a linear trend observed in cluster i (red points, Fig. 2). These 137
analyses suggest that the subnetwork of hubs shown in Fig. 1C also involves genes with 138
methylation signals that have a role in PALL development 14. 139
140 Figure 2. Scatter plots of network centralities measures. A general non-linear trend is notable for 141 genes/nodes from clusters iii to ii, while the linear trend in cluster i can be visualized. The highest values 142 of network centralities: degree, betweenness, centroid, stress, and radiality, are found in cluster i, which 143 correspond to the main subnetwork of hubs presented in Fig. 1B (consistent with the lowest values of 144 average-shortest-path-length). Networks from clusters i, ii, and iii are shown in Supplementary Fig. S4. 145 146
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
8
Results of network enrichment analysis of DMG and DEG PPI networks built with 147
STRING (Cytoscape) are shown in Fig. 3 (Supplementary Tables S1 an S2). The analyses 148
indicate that DMG and DEG datasets targeted many of the same pathways with overlap of 80% 149
(Fig. 3C). Pathways linked to cancer development and apoptosis are notable, and KEGG 150
pathways in cancer (hsa05200) showed pronounced enrichment, with more than 50 and 40 genes 151
from the DMG and DEG datasets, respectively. 152
153 Figure 3. Network based enrichment analysis of protein-protein interaction (PPI) networks independently 154 derived from DMGs and DEGs estimated in patients with PALL. A, PPI enriched network of DEGs with 155 15 or more genes. B, PPI enriched network of DMGs with 20 or more genes. C, Venn diagram with the 156 overlapping of all PPI enriched networks of DMGs and DEGs with 7 or more genes. The PPI enriched 157 network analysis was performed in STRING app on Cytoscape, 11,12 and the analysis is limited to KEGG 158 human pathways. 159
160
In the case of patients with PALL, enrichment for PI3K-Akt signaling pathway, MAPK 161
signaling pathway, JAK-STAT signaling pathway, Wnt signaling pathway, and Focal adhesion 162
(all included in KEGG pathway in cancer) was statistically significant for both DMG and DEG 163
subsets. The Venn diagram shown in Fig. 3C implies a high level of concordance between the 164
enriched KEGG pathways identified in PPI networks from DEGs and from DMGs. 165
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
9
166
167 Figure 4. Graphical evaluation of the concordance between DEG and DMG enrichments on KEGG 168 pathways. A, scatterplot of pathway ratings (see Eq. 1) from enriched pathways on the set of DMGs 169 (𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷) and DEGs (𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷), respectively. The regression analysis shows the linear trend of the 170 relationship 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 > 0 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 > 0 (black dots). The identity dashed line (in blue) helps in 171 gauging the degree of agreement between measurements 15. Dots in red highlight pathways for which 172 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = 0 or 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = 0. B, Bland-Altman plot of the agreement, on targeting gene pathways, between 173 responses from gene expression and methylation regulatory systems. The agreement between 174 measurements can also be tested by values of the Lin's concordance correlation coefficient (𝜌𝜌𝐶𝐶𝐶𝐶) and 175 Kendall coefficient of concordance (𝜌𝜌𝐾𝐾𝐶𝐶). 176
177
Figure 4 supports a strong concordance between the enriched KEGG pathways identified 178
in PPI networks from DEGs and from DMGs. Bootstrap Bayesian estimation of the Lin's 179
concordance correlation coefficient (𝜌𝜌𝑐𝑐𝑐𝑐) yielded a value of 𝜌𝜌𝑐𝑐𝑐𝑐 = 0.71 with a confidence 180
interval (C.I.) 0.52 ≤ 𝜌𝜌𝑐𝑐𝑐𝑐 ≤ 0.84, and a Kendall coefficient of concordance 𝜌𝜌𝐾𝐾𝐶𝐶 = 081 181
(permutation p-value < 0.001). The linear regression analysis presented in Fig. 4A indicates a 182
statistically significant linear relationship between the pathway score (𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷) of enriched KEGG 183
pathways in DMG PPI network (see definition at equation (1)) and pathway score (𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷) of 184
enriched KEGG pathways in DMG PPI network. The proximity of most of the regression points 185
(pairs of pathways scores) around the identity line (dashed line in blue) suggests significant 186
agreement between methylation and gene expression regulatory systems, also indicated by a 187
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
10
regression slope of 0.9. This concordance between gene expression and methylation is 188
graphically corroborated by Bland-Altman plot 15, where almost all the points are located in 189
between the mean − 2σ and mean + 2σ horizontal lines (Fig. 4B). 190
DEG-DMG network analysis 191
NBEA and NEAT 9,10 were applied to identify network-related genes from the set of DEGs-192
DMGs (191, 1774 genes). The PPI network of 191 DEGs-DMGs is shown in Supplementary Fig. 193
S6 (Supplementary Table S2). Three clusters were detected by applying K-means clustering on 194
the main PPI-network of DEGs-DMGs and two of them integrated the subnetworks of hubs 195
shown in Fig. 5B and D, while the third cluster gave rise to several subsets of subnetworks. 196
Enrichments detected in the main PPI network of 191 DEGs-DMGs network (Fig. 5A) and 197
subnetworks (Fig. 5C and 5E. Supplementary Table S2)) are consistent with previous results 198
(Fig. 3): i) only focused on the set of DMGs (not all of them DEGs, Fig. 3A) and ii) only focused 199
on the set of DEGs (not all of them DMGs, Fig. 3B). 200
Group means of methylation level differences at each gene-body DMP for genes 201
NOTCH1, CD44, and BCL2L1 (hubs from the DMGs-DEGs sub-network from Fig. 5B) are 202
shown in Fig. 6A. NOTCH1 and CD44 have been reported to be epigenetically regulated 16–19 203
and, in particular, NOTCH1 has been proposed as a drug target for the treatment of T-cell acute 204
lymphoblastic leukemia 17. BCL2L1 is known to have roles in apoptosis and has been proposed 205
as a drug target for cancer treatment 20. Genes from activation of the mitogen-activated protein 206
kinase (MAPK) pathway are frequently altered in cancer and have been proposed as drug targets 207
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
11
as well 21. 208
209
Figure 5. Network enrichment for the network-related DEG-DMGs. A, bar-plots of the enriched KEGG 210 pathways in the PPI-network of 191 DEG-DMGs (Supplementary Fig. S6). B and D, subnetworks 211 integrated by gene-hubs identified with K-means clustering of the network from panel. C and E, bar-plots 212 of the enriched KEGG pathways on the networks from panels B and D, respectively. In the networks, 213 nodes with the same color belong to the same cluster obtained with K-Medoid clustering. Gene hubs were 214 identified based on betweeness centrality and node degree, such that the size of each node is proportional 215 to its value of betweeness centrality and the label font size is proportional to its node degree. Edge color 216 is based on coexpression index from white (0.042) to red (0.938). The PPI network and the enrichment 217 analyses were performed in STRING app on Cytoscape 11,12. 218
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
12
219 Figure 6. DEG-DMGs reported as cancer related gene (oncogene) lists. A, Group mean of methylation 220 level differences at each cytosine identified differentially methylated genes (DMGs). BCL2L1, CD44, 221 MAP3K1 and NOTCH1 are linked to leukemia and other types of cancers. These genes were identified as 222 “hubs” of PPI networks (Fig. 5B and D). Irregular distribution of methylation signal, hyper- and hypo- 223 methylated, can be viewed. Traditional DMR-based approaches fail to detect these types of variation. 224 Methylation level differences were computed for control and treatment individuals with respect to normal 225 CD19+ methylome from four independent blood donors used as reference. This approach provides an 226 estimation of the natural variability of methylation changes existing in the control population. B, 227 Overlapping (≥500bp) between the differentially methylated enhancer regions (DMERs) and DEGs-228 DMGs. Although only 51 enhancers (DMERs) are activators of reported DEGs, the DMERs overlap with 229 159 DEGs-DMGs regions, from which 23 are reported oncogenes (see Methods). A total of 379 DEGs-230 DMGs are reported oncogenes. 231 232 233
Three members of this pathway are found in the sub-network DMG-DEGs shown in Fig. 234
5D and in the DMP distribution on MAP3K1 gene-body shown in Fig. 6A. In whole, 379 235
identified DEG-DMGs have been reported as cancer-related genes (Fig. 6B). 236
Differentially methylated enhancer regions (DMERs) 237
Our initial analysis was limited to the methylation signal carried on gene-body regions. As 238
suggested in Fig. 6, gene-associated methylation signal can also be present on genomic regions 239
upstream and downstream to genes, including transcription enhancer regions 22. Analysis of the 240
methylation datasets identified 325 differentially methylated enhancer regions (DMERs). 241
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
13
242 Although only 51 enhancers from the 325 identified DMERs are activators of reported DEGs 243
(Supplementary Table S2), the list of DEG-DMG regions covered by DMERs (in at least 500bp) 244
reach a total of 159 (Fig. 6B), from which 23 were identified oncogenes. 245
The top 29 genes with highest density variation of DMP number within enhancer regions 246
are shown in Figure 7. Many of these genes have been reported to be associated with cancer 247
development and were found in the sets of DMGs or DEGs. One example is the enhancer region 248
influencing gene EPIDERMAL GROWTH FACTOR-LIKE DOMAIN 7 (EGFL7) and the micro-249
RNA MIR-126, both associated with cancer 23,24. As shown in Figure 7B, MIR-126 resides within 250
an intron of EGFL7 and their enhancer region overlaps. 251
252 Figure 7. DEGs with differentially methylated enhancer region. A, Top 29 genes with the highest density 253 variation of DMP number (> 1.7 DMPs/kb) in the enhancer region. Bars in dark blue denote genes that 254 have been reportedly associated with cancer development. B, Group mean of methylation level 255 differences at each cytosine identified differentially methylated enhancer regions corresponding to the 256 genes: SMARCA4, EGRL7, MIR126, NUDT1, and CDK9. 257
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
14
258 MIR-126 modulates vascular integrity and angiogenesis, and it has been reported that 259
MIR-126 delivered via exosomes from endothelial cells promotes anti-tumor responses 25. The 260
hypomethylation pattern observed in the region spans a substantial part of gene AGPAT2, which 261
was identified as a DMG and, although over-expressed in different types of cancer, was not 262
reported as a DEG in the earlier PALL study 26. AGPAT2 promotes survival and etoposide 263
resistance of cancer cells under hypoxia 27. 264
Association between methylation and gene expression 265
Results to date suggest the existence of an association, or at least statistical inter-dependence, 266
between methylation and gene expression. To investigate this association, density variations of 267
the methylation signal were quantitatively expressed by different measurements: density of 268
methylation level difference �∆𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�, density of total variation difference �∆𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�, and 269
�∆𝐻𝐻𝐻𝐻𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑� (see method section for variable descriptions). Gene expression was shown as 270
absolute value of the logarithm base 2 of fold change, |𝑙𝑙𝑙𝑙𝑙𝑙2𝐹𝐹𝐹𝐹|. 271
The association between methylation and gene expression for the current study of patients 272
with PALL is shown in Supplementary Fig. S7. This association is not only corroborated by a 273
highly significant Spearman's rank correlation rho (p-value lesser than 0.001, Supplementary 274
Fig. S7), but also by two-dimensional kernel estimation (2D-KDE ) and Farlie-Gumbel-275
Morgenstern (FGM) copula of joint probability distributions for each annotated pair of variables 276
in the coordinate axes from the contour-plot plane (Supplementary Fig. S7). 277
Results indicate that methylation and gene expression show positive dependence. Roughly 278
speaking, a bivariate distribution is considered to have a specific positive dependence property if 279
larger values of either random variable are probabilistically associated with larger values of the 280
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
15
other random variable 28. According to Lai29, the FGM copulas shown in Supplementary Fig. S7 281
indicate CDM and gene expression to be positively quadrant dependent and positively regression 282
dependent. In other words, if 𝑋𝑋 is the density of methylation level difference, the regression 283
𝐸𝐸(𝑌𝑌|𝑋𝑋 = 𝑥𝑥) is linear in x 29. Thus, the regression of the conditional expected value of gene 284
expression with respect to density variations of methylation signal X is linear in x (possible 285
values of X). This linear trend is noticed with high joint probability in the outlined contour-plot 286
red regions (Supplementary Fig. S7). 287
PC-score of DEG-DMGs 288
The identification of genes playing fundamental roles in cancer progression is limited by the 289
availability of protein-protein interaction information in a database (STRING, in the current 290
case). Consequently, results could be mostly populated with genes from network-associated 291
diseases. To circumvent these possible limitations, principal component analysis (PCA) was 292
applied to score genes according to their discriminatory power to discern the disease state from 293
healthy. PCA was performed on the set of individuals, representing each in the 1775-dimensional 294
space DEGs-DMGs, where each gene was represented by the density of an information 295
divergence on gene-body, which provides a normalized measurement of the intensity of the 296
methylation signal. Two PC-scores were derived from two information divergences: 1) absolute 297
difference of methylation levels and 2) Hellinger divergence. 298
The first principal component (PC1) was used to build the PC-scores for DMGs, since it 299
carried 85% of the whole sample variance with eigenvalues greater than 1 (Guttman-Kaiser 300
criterion 30, see Methods). A list of the first 12 genes with top PPI-network PC-scores is 301
presented in Table 1, indicating genes associated with cancer development and further 302
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
16
confirming that regardless of approach followed, genes involved with cancer origin and 303
progression are DEG-DMGs. 304
Discussion 305
Data from this study reflect non-random methylation repatterning that targets gene networks 306
reportedly associated with cancer development and risk. The majority of DNA methylation 307
changes fall within intergenic regions of the genome, and only 4795 (including non-coding) of 308
the 57241 annotated human genes were identified as DMGs. This result suggests that in patients 309
with PALL, the methylation machinery may selectively target specific genes. The methylation 310
signal is observed not only within gene-body regions of DMGs, but also (and frequently with 311
high intensity) in upstream and downstream domains. 312
Network analysis of DMGs identified several KEGG pathways and genes associated with 313
cancer. Relevant genes were identified as network hubs and grouped into clusters of network 314
hubs carrying the highest network centrality measurements (Fig. 1 and 5). Presumably, 315
disruption of a network hub by altering the gene, or others that regulate the hub, could alter the 316
entire gene network 14,31. Thus, identification of hubs offers candidate targets in the search for 317
potential biomarkers. The strong linearity trends observed in pairwise regression between the 318
centrality measurements (Fig. 2) in the main hub cluster (Fig. 1A) suggests that genes from the 319
cluster are non-randomly targeted by the action of methylation regulatory machinery during 320
PALL development 14. 321
Clusters of hubs integrating PPI subnetworks comprise the backbone of a network. The 322
essentiality of gene hubs in preserving the integrity of the interacting network is quantitatively 323
expressed in network centrality statistics. For sub-networks of hubs (Fig. 1 and 5), higher 324
centrality values and linear relationships between the centrality statistics of the network hubs 325
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
17
reflects a higher number of reported biologically meaningful associations between the hubs and 326
the other genes on the sub-network and the main network (Fig. 2). 327
Strong correspondence was seen in the network enrichment analyses derived from PPI 328
networks in DMGs and DEGs (Fig. 3), supporting the non-random nature of methylation signals 329
within protein-coding regions in signaling pathways linked to cancer development. Although not 330
all DEGs are detected as DMGs and vice versa, massive overlap of enriched KEGG pathways 331
(Fig. 3) suggest a coordinated response of methylation and gene-expression machineries. This in 332
concert regulatory response was statistically supported by Lin's concordance correlation 333
coefficient and Kendall coefficient of concordance. 334
An example of coordinated regulatory response of methylation and gene expression is seen 335
in the case of the EGFR gene, identified as a hub in the DMG network (Fig. 1). EFGR is a 336
tyrosine kinase that regulates autophagy via the PI3K/AKT1/mTOR, RAS/MAPK1/3 (enriched 337
pathways shown in Fig. 3A and B, and in Fig. 5A and E), and STAT3 signaling pathways 32,33. 338
Although EGFR was not a reported DEG, its activators, EPIDERMAL GROWTH FACTOR 339
(EGF, Fig. 5B) and EGFL7 were identified as both DMGs and DEGs. EGFL7 is reported to be a 340
key factor for the regulation of the EGFR signaling pathway 34. Additionally, EGFL7 is a 341
secreted angiogenic factor that can result in pathologic angiogenesis and enhance tumor 342
migration and invasion via the NOTCH signaling pathway 23 (a pathway enriched in the PPI-343
DMG network). The NOTCH pathway is a conserved intercellular signaling pathway that 344
regulates interactions between physically adjacent cells. In the set of patients with PALL, 345
NOTCH1 is reported as a DEG and DMG (Fig. 1A and 5B). 346
Another example of the gene network architecture of leukemia emerges by tracking up- 347
and downstream interconnections of genes PIK3CG (DEG-DMG) and PIK3CD (a DMG 348
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
18
network hub, Fig. 1) from the PI3K/AKT signaling pathway (enriched in the set of DEG-DMGs, 349
Fig. 5). Phosphatidylinositol-4,5-bisphosphate 3-kinase (PI3K) is a critical node in the B-cell 350
receptor (BCR, a DEG-DMG) signaling pathway and its isoforms, PIK3CD and PIK3CG are 351
involved in B-cell malignancy 35. Crosslinking CD19 with the BCR augments PI3K activation, 352
and VAV proteins, VAV1 (DMG), VAV2 (DEG-DMG), and VAV (DEG-DMG) also 353
contributes to PI3K activation downstream of BCR and related receptors 36. BCR and its 354
downstream signaling pathways, including Ras/Raf/MAPK, JAK/STAT3, and PI3K/AKT (all 355
enriched in PALL patients, Fig. 3 and 5), play important roles in malignant transformation of 356
leukemia 37. 357
Our analysis also considered gene regulatory domains upstream and downstream to gene-358
body regions and, in particular, enhancer regions. The set of genes targeted by DMERs does not 359
integrate to a PPI network, but is found in signaling pathways or regulators from them. As in the 360
previous analyses, enhancer methylation repatterning identifies genes known to be involved in 361
cancer development (Fig. 6B). For example, SMARCA4 (Fig. 7) encodes an ATPase of the 362
chromatin remodeling SWI/SNF complexes frequently found upregulated in tumors 38 and 363
represents a DEG-DMG in patients with PALL. The product of this gene can bind BRCA1 364
(DEG-DMG) 39 and also regulates the expression of the tumorigenic protein CD44 (DEG-DMG) 365
40. 366
PPI networks are only models to identify highly interconnected players from the subjacent 367
web architecture of genes involved in a specific biological process. Thus, results from the 368
application of more than one network model can complement, and different network models do 369
not necessarily overlap 100% with the set of enriched pathways. Deriving subsets of the DEG-370
DMG dataset by applying MCODE clustering increased confidence over previous results. 371
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
19
The integrative analyses of DMGs, DEGs and the networks derived from them, as well as 372
DMERs (graphically summarized in Fig. 1 to 7), provided consistent indication of a web of 373
interacting genes in cancer development and an association between gene methylation 374
repatterning and gene expression changes. This association was supported by Spearman’s rank 375
correlation rho and the bivariate FGM copula (Supplementary Fig. S7), which implies a linear 376
dependence for expected values of gene expression changes in methylation changes for the set of 377
DEG-DMGs. 378
Our analysis uncovered a stochastic deterministic dependence relationship, where larger 379
values of gene expression changes are probabilistically associated with larger values of 380
methylation changes (in the whole set of 1772 DEG-DMGs). Within the set of DEG-DMGs, 381
observed changes in gene expression were not statistically independent of the methylation 382
changes, showing association with a significant linear trend (Supplementary Fig. S7). This result 383
may be indication that the relationship between gene methylation repatterning and altered gene 384
expression would be present at lower density methylation levels. Such a relationship can be 385
overlooked with over-stringent filtering of methylome data. Three analytical approaches assist in 386
discovering this association: i) signal detection for DMP identification, ii) GLM-based group 387
comparison for DMG identification, and iii) copula modeling of stochastic dependence. 388
Our results demonstrate the potential of integrative network analysis of DMGs and DEGs 389
for the identification of biologically relevant methylation biomarkers. Numerous clusters of 390
interacting genes are detected in the sub-networks of hubs from PPI networks of DMGs and 391
DEGs, a few of which are described here. More detailed analysis of these data has allowed us to 392
propose three factors likely to be important to biomarker identification. A potential biomarker 393
must 1) be a DMG or a DEG-DMG with one or more well defined differential methylation 394
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
20
pattern(s) on gene-body, upstream or downstream of the gene-body; 2) integrate one or more 395
gene pathways that are biologically relevant for leukemia and, simultaneously, show enrichment 396
in the PPI networks of DMGs and DEGs, and 3) represent a hub or be biologically connected to 397
a relevant hub. Genes holding to these guidelines integrate the subnetworks of hubs shown in 398
Figs. 1B and 4C-D, and the list of potential biomarkers can be extended using the information 399
provided in the Supplementary Tables S1 and S2. 400
Intersection of the identified networks with available data from independent studies of 401
cancer further supports the potential of our approach for identifying robust disease biomarkers. 402
However, while intersection of methylome and gene expression data with cancer-relevant gene 403
networks is compelling, we cannot eliminate the possibility that these outcomes may be 404
influenced by the relative abundance of cancer-related networks within the various databases 405
currently available. To help circumvent this limitation, we proposed ranking the DEG-DMGs 406
based on their discriminatory power to discern disease state from healthy. 407
Potential biomarkers can be scored with the application of PCA (Table 1 and 408
Supplementary Table S2). In this study, the first PC was sufficient to build a PC-score of DEG-409
DMGs based on gene-body methylation signal intensity. PC-scores identify cancer-related genes 410
not identified by the PPI network approach, although not all relevant genes were identifiable, 411
e.g., NOTCH1. Within a long gene like NOTCH1, the non-homogenous distribution of gene 412
body methylation signal (Fig. 6A) can result in apparently low density methylation signal 413
globally, even when signal is high locally. Nevertheless, PC-score provides an acceptable 414
complement to the PPI network approach. Results obtained with the approach proposed here 415
support its application to the identification of reliable and stable biomarkers for cancer diagnosis 416
and prognosis. Lists of genes relevant as biomarker candidates for leukemia (several of which 417
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
21
have already been proposed as biomarkers by others) are provided in the Supplementary Tables 418
online. 419
Materials and Methods 420
Methylation and gene expression datasets 421
The datasets of genome-wide methylated and unmethylated read counts (for each cytosine site) 422
from normal CD19+ blood cell donor (NB) and from patients with pediatric acute lymphoblastic 423
leukemia (PALL) where downloaded from the Gene Expression Omnibus (GEO) database. 424
DMPs were estimated for control (NB, GEO accession: GSM1978783 to GSM1978786) and for 425
patients (ALL cells, GEO accession number GSM1978759 to GSM1978761) relative to a 426
reference group of four independent normal CD19+ blood cell donor (GEO accession: 427
GSM1978787 to GSM1978790). The datasets of DEGs from the group of patients with PALL 428
were taken from the Supplementary information provided in the previous study 7. 429
A list of 2,579 cancer-related genes compiled by Bushman Lab 430
(http://www.bushmanlab.org/links/genelists) was used to identify DEG-DMGs oncogenes. 431
Methylation analysis 432
Methylation analysis was performed by using our home pipeline Methyl-IT version 0.3.1 (a R 433
package available at https://git.psu.edu/genomath/MethylIT). Estimation of differentially 434
methylated positions (DMPs) is consistent with the classical approach using Fisher’s exact test 435
except for a further application of signal detection (see examples of methylation analysis with 436
MethylIT at https://github.com/genomaths/MethylIT). Need for the application of signal 437
detection in cancer research was pointed out decades ago 41. Here, application of signal detection 438
was performed according to standard practice in current implementations of clinical diagnostic 439
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
22
tests 42–44. That is, optimal cutoff values of the methylation signal were estimated on the receiver 440
operating characteristic curves (ROCs) based on ‘Youden Index’42 and applied to identify DMPs. 441
The decision of whether a DMP is detected by Fisher’s exact test (or any other statistical test 442
implemented in Methyl-IT) is based on optimal cutoff value 43. 443
Estimation of differentially methylated regions (DMRs). The regression analysis of the 444
generalized linear model (GLMs) with logarithmic link, implemented in MethylIT function 445
countTest, was applied to test the difference between groups of DMP numbers/counts at 446
specified genomic regions, regardless of direction of methylation change. Here, the concept of 447
DMR is generalized and it is not limited to any specific genomic region found with specific 448
clustering algorithm. It can be applied to any naturally or algorithmically defined genomic 449
region. For example, an exon region identified statistically to be differentially methylated by 450
using GML is a DMR. In particular, a DMR spanning a whole gene-body region shall be called a 451
DMG. DMGs were estimated from group comparisons for the number of DMPs on gene-body 452
regions between control (CD19+ blood cell donor) and ALL cells based on generalized linear 453
regression. 454
The fitting algorithmic approaches provided by glm and glm.nb functions from the R 455
packages stat and MASS were used for Poisson (PR), Quasi-Poisson (QPR) and Negative 456
Binomial (NBR) linear regression analyses, respectively. These algorithms are implemented in 457
the Methyl-IT function countTest and countTest2, which only differ in the way to estimate the 458
weights used in the GLM with NBR. The following countTest parameters were used: minimum 459
DMP count per individual (8 DMPs), test P-value from a likelihood ratio test (test = “LRT”) 460
and P-value adjustment method (Benjamini & Hochberg45), cut off for P-value (α = 0.05), and 461
Log2Fold Change for group DMP number mean difference >1. 462
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
23
The methylation analysis of genomic regions to identify differentially methylated enhancer 463
regions (DMERs) was performed on a set of enhancers reported by Hnisz et al. 46. Usually, the 464
size of the genomic region covered by an enhancer varies depending on the tissue type. In our 465
current case, for each enhancer we analyzed the maximum region spanning all reported sizes for 466
different tissues. 467
Network analysis 468
Protein-protein interaction (PPI) networks were built with STRING app of Cytoscape 11,12. 469
Network analysis were conducted in Cytoscape. When the number of genes exceeded l00 for 470
network analysis, biologically meaningful web connections were difficult to visualize. 471
Biologically relevant subsets of genes were obtained from the whole set of genes (DMGs, DEG, 472
or DEGs-DMGs) by using the R packages NBEA and NEAT 9,10. Alternatively, Cytoscape app 473
MCODE was then used for subsetting an entire network 47. PPI subnetworks from four network 474
modules identified with MCODE are shown. MCODE parameters for degree cutoff: 10, node 475
density cutoff: 0.01, node score cutoff: 0.2, K- score 10, and max. depth: 100. K-mean clustering 476
algorithm was applied to each subnetwork to obtain subnetworks of hubs using the following 477
node attributes for clustering: betweenness-centrality, degree, closeness-centrality, and 478
clustering coefficient. 479
Network hubs were identified based on betweenness-centrality and node degree, where 480
size of each node (in PPI network) is proportional to its value of betweenness-centrality and label 481
font size is proportional to its node degree. Network enrichment analysis in KEGG pathways 482
follows each graphic subnetwork. 483
484
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
24
Concordance test for DEG and DMG enrichments on KEGG pathways 485
The concordance between DEG and DMG enrichments on KEGG pathways, derived from the 486
PPI network via STRING app in Cytoscape, was evaluated with the application of the Lin's 487
concordance correlation coefficient (𝜌𝜌𝑐𝑐𝑐𝑐) and Kendall coefficient of concordance (𝜌𝜌𝐾𝐾𝐶𝐶). The R 488
package agRee was used for the bootstrap Bayesian estimation of 𝜌𝜌𝐶𝐶𝐶𝐶 point value and confidence 489
interval 48; while the R package vegan was used to compute 𝜌𝜌𝐾𝐾𝐶𝐶 through a permutation test 49. 490
To perform the concordance test, a score was assigned to each enriched KEGG pathway 491
from DEGs and DMGs based on the number of genes in the pathway and on its corresponding 492
statistical signification based on its FDR p-value. Only pathways with FDR p-value lesser than 493
0.0004 were considered. A new variable, statistical signification (sig) was defined according 494
with the scale: 𝑣𝑣𝑠𝑠𝑙𝑙 = 1, 2, 3, for p-values in the intervals (10−5, 10−4 ), (10−6, 10−5 ), and 495
(0, 10−6 ), respectively. The valor of 𝑣𝑣𝑠𝑠𝑙𝑙 = 0 was assigned to pathways not enriched in one of 496
the group, DEGs or DMGs. For example, Phosphatidylinositol signaling system was not 497
enriched in the set of PPI-DMGs and, consequently 𝑣𝑣𝑠𝑠𝑙𝑙𝐷𝐷𝐷𝐷𝐷𝐷 = 0, but it was enriched in the set of 498
PPI-DEGs with 𝑣𝑣𝑠𝑠𝑙𝑙𝐷𝐷𝐷𝐷𝐷𝐷 = 3. Then, a new variable, named pathway score was defined according 499
to the formula: 500
𝑃𝑃 = # 𝑙𝑙𝑜𝑜 𝑙𝑙𝑣𝑣𝑔𝑔𝑣𝑣𝑣𝑣 𝑠𝑠𝑔𝑔 𝑝𝑝𝑝𝑝𝑝𝑝ℎ𝑤𝑤𝑝𝑝𝑤𝑤 × 𝑣𝑣𝑠𝑠𝑙𝑙 (1) 501
We would use the notation 𝑃𝑃𝑘𝑘𝑑𝑑 to indicate that the rating was performed for pathway 𝑠𝑠 502
identified on the gene set k (k =DMGs, DEGs). That is, the pathway score P not only carries 503
information on how many genes are found on each pathway but also information on the 504
enrichment statistical signification. The estimated values of 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝑑𝑑 and 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝑑𝑑 for each enriched 505
pathway 𝑠𝑠 (from DEGs and DMGs sets, respectively) were used in the concordance tests and in 506
the Bland-Altman plot (Fig. 3E). 507
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
25
Stochastic association between methylation and gene expression 508
To investigate such an association, the methylation density of gene regions simultaneously 509
identified as DEGs and DMGs were expressed in terms of different magnitudes: 1) 𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 , 510
density of methylation levels (i: control or patients); 2) 𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 , density of the difference of 511
methylation levels between each group (control or patients) and an independent group of four 512
healthy individuals (reference group); 3) 𝑇𝑇𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 , 𝑇𝑇𝑇𝑇 with Bayesian correction, and 4) 513
𝐻𝐻𝐻𝐻𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 , density of Hellinger divergence, where i denotes the group mean, control or patient. 514
The density in 1000 bp of a variable X at a given gene region was defined as the sum of the 515
magnitude X divided by the length of the region and multiplied by 1000. The differences of 516
methylation densities between control and patient groups were estimated as the absolute 517
difference of methylation levels �∆𝑋𝑋𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑� = �𝑋𝑋𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑐𝑐𝑐𝑐𝑑𝑑𝑑𝑑𝑟𝑟𝑐𝑐𝑜𝑜 − 𝑋𝑋𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑝𝑝𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑�, where 𝑋𝑋𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 represents 518
one of the mentioned variables. Methyl-IT R package provides all the functions to obtain all 519
variable mentioned here (https://github.com/genomaths/MethylIT and 520
https://github.com/genomaths/MethylIT.utils). 521
Spearman's rank correlation 𝜌𝜌 (rho) was estimated to evaluate the association between the 522
pairs of variable |∆𝑙𝑙𝑙𝑙𝑙𝑙2𝐹𝐹𝐹𝐹| versus: �∆𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�, �∆𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�, �∆𝑇𝑇𝑇𝑇𝐻𝐻𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�, and �∆𝐻𝐻𝐻𝐻𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�. 523
Since correlation analysis only measures the degree of dependence (mainly linear) but does not 524
clearly discover the structure of dependence, we further investigate the structural dependence 525
between these variables with application of Farlie-Gumbel-Morgenstern (FGM) copula. FGM 526
copula model estimation was performed with R package copula 50. 527
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
26
Principal component analysis (PCA) 528
PCA is standard statistical procedure to reduce data dimensionality, to represent the set of DMGs 529
by new orthogonal (uncorrelated) variables, the principal components (PCs) 51, and to identify 530
the variables with the main contribution to the PCs carrying most the sample variance. Herein, a 531
PC-based score (PC-score) was built by ranking the DEG-DMGs based on its discriminatory 532
power to discern between the disease state and healthy. Each individual was represented as 533
vector of the 1775-dimensional space of DEG-DMGs. Two PC-scores were estimated: the first 534
based on the density of Hellinger divergence on the gene-body and the second one based on the 535
density of the absolute value of methylation levels difference. The density of a magnitude x is 536
defined as the sum of x at each DMP divided by the gene width (in base-pairs). The first 537
principal component (PC1) was used to build a PC-based score for the DEG-DMG set, since it 538
had an eigenvalues (variance) greater than 1 and carried more than 85% of the whole sample 539
variance (Guttman-Kaiser criterion 30). The PC-score was built using the absolute values of the 540
coefficients (loadings) in PC1 for each variable (gene). Since the sum of the squared of variable 541
loadings over a principal component is equal to 1, the squared loadings tell us the proportion of 542
variance of one variable explained by the given principal component. Thus, the greater is the PC-543
score value, the greater will be the discriminatory power carried by the gene. 544
The density of HD on the gene-body was computed with MethylIT function 545
getGRegionsStat and the principal component with function pcaLDA, which conveniently 546
applies the PCA calling function prcomp from the R package ‘stats’. 547
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
27
Acknowledgements 548
We wish to thank Dr. Xiaodong Yang and Thomas Maher for valuable discussions during 549
the development of studies. This study was supported by a grant from the Bill and Melinda Gates 550
Foundation (OPP1088661) to S.A.M. 551
Author contributions 552
R.S. designed experiments conducted mathematical and statistics analyses. S.M. assessed 553
experiments and edited manuscript. 554
Competing interests 555
The authors declare no competing interests. 556
Data availability 557
All the methylome datasets and software used in this work are publicly available. The MethylIT 558
R package used in the DMP and DMG estimations, as well as several examples on how to use 559
Methyl-IT, are available at GitHub: https://github.com/genomaths/MethylIT. The datasets 560
supporting conclusions of this report are included within Supplementary material. 561
562
References 563
1. Suresh, N. T. & Ashok, S. Comparative Strategy for the Statistical & Network based 564
Analysis of Biological Networks. Procedia Comput. Sci. 143, 165–180 (2018). 565
2. Hogan, L. E. et al. Integrated genomic analysis of relapsed childhood acute lymphoblastic 566
leukemia reveals therapeutic strategies. Blood 118, 5218–26 (2011). 567
3. Nordlund, J. et al. Genome-wide signatures of differential DNA methylation in pediatric 568
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
28
acute lymphoblastic leukemia. Genome Biol. 14, r105 (2013). 569
4. Chatterton, Z. et al. Epigenetic deregulation in pediatric acute lymphoblastic leukemia. 570
Epigenetics 9, 459–67 (2014). 571
5. Nordlund, J. & Syvänen, A. C. Epigenetics in pediatric acute lymphoblastic leukemia. 572
Semin. Cancer Biol. 51, 129–138 (2018). 573
6. Rahmani, M., Talebi, M., Hagh, M. F., Feizi, A. A. H. & Solali, S. Aberrant DNA 574
methylation of key genes and Acute Lymphoblastic Leukemia. Biomedicine and 575
Pharmacotherapy 97, 1493–1500 (2018). 576
7. Wahlberg, P. et al. DNA methylome analysis of acute lymphoblastic leukemia cells 577
reveals stochastic de novo DNA methylation in CpG islands. Epigenomics 8, 1367–1387 578
(2016). 579
8. Tate, J. G. et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids 580
Res. (2018). doi:10.1093/nar/gky1015 581
9. Geistlinger, L. EnrichmentBrowser: Seamless navigation through combined results of set-582
based and network-based enrichment analysis. R package version 2.1.0. 1–15 (2015). 583
10. Signorelli, M. et al. NEAT: an efficient network enrichment analysis test. BMC 584
Bioinformatics 17, 352 (2016). 585
11. Shannon, P. et al. Cytoscape: A software Environment for integrated models of 586
biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003). 587
12. Szklarczyk, D. et al. The STRING database in 2017: Quality-controlled protein-protein 588
association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368 589
(2017). 590
13. Jalili, M. et al. Evolution of centrality measurements for the detection of essential proteins 591
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
29
in biological networks. Frontiers in Physiology 7, 375 (2016). 592
14. Pavlopoulos, G. A. et al. Using graph theory to analyze biological networks. BioData 593
Mining 4, 10 (2011). 594
15. Martin Bland, J. & Altman, D. Statistical methods for assessing agreement between two 595
methods of clinical measurement. Lancet 327, 307–310 (1986). 596
16. Huang, Y.-C. C. et al. Epigenetic regulation of NOTCH1 and NOTCH3 by KMT2A 597
inhibits glioma proliferation. Oncotarget 5, 63110–63120 (2017). 598
17. Waibel, M. et al. Epigenetic targeting of Notch1-driven transcription using the HDACi 599
panobinostat is a potential therapy against T-cell acute lymphoblastic leukemia. Leukemia 600
32, 237–241 (2018). 601
18. Eberth, S. et al. Epigenetic regulation of CD44 in Hodgkin and non-Hodgkin lymphoma. 602
BMC Cancer 10, 517 (2010). 603
19. Müller, I., Wischnewski, F., Pantel, K. & Schwarzenbach, H. Promoter- and cell-specific 604
epigenetic regulation of CD44, Cyclin D2, GLIPR1 and PTEN by Methyl-CpG binding 605
proteins and histone modifications. BMC Cancer 10, 297 (2010). 606
20. Chu, L. H. & Chen, B. Sen. Construction of a cancer-perturbed protein-protein interaction 607
network for discovery of apoptosis drug targets. BMC Syst. Biol. 2, 56 (2008). 608
21. Xue, Z. et al. MAP3K1 and MAP2K4 mutations are associated with sensitivity to MEK 609
inhibitors in multiple cancer models. Cell Res. 28, 719–729 (2018). 610
22. Lou, S. K. et al. Whole-genome bisulfite sequencing of multiple individuals reveals 611
complementary roles of promoter and gene body methylation in transcriptional regulation. 612
Genome Biol. 15, (2014). 613
23. Wang, J. et al. EGFL7 participates in regulating biological behavior of growth hormone–614
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
30
secreting pituitary adenomas via Notch2/DLL3 signaling pathway. Tumor Biol. 39, 615
1010428317706203 (2017). 616
24. Yang, C. et al. Increased expression of epidermal growth factor-like domain-containing 617
protein 7 is predictive of poor prognosis in patients with hepatocellular carcinoma. J. 618
Cancer Res. Ther. 14, 867–872 (2018). 619
25. Tomasetti, M. et al. MiR-126 in intestinal-type sinonasal adenocarcinomas: exosomal 620
transfer of MiR-126 promotes anti-tumour responses. BMC Cancer 18, 896 (2018). 621
26. Song, L. et al. Silencing LPAATβ inhibits tumor growth of cisplatin-resistant human 622
osteosarcoma in vivo and in vitro. Int. J. Oncol. 50, 535–544 (2017). 623
27. Triantafyllou, E.-A., Georgatsou, E., Mylonis, I., Simos, G. & Paraskeva, E. Expression of 624
AGPAT2, an enzyme involved in the glycerophospholipid/triacylglycerol biosynthesis 625
pathway, is directly regulated by HIF-1 and promotes survival and etoposide resistance of 626
cancer cells under hypoxia. Biochim. Biophys. Acta - Mol. Cell Biol. Lipids 1863, 1142–627
1152 (2018). 628
28. Kimeldorf, G. & Sampson, A. R. A framework for positive dependence. Ann. Inst. Stat. 629
Math. 41, 31–45 (1989). 630
29. Lai, C. D. Morgenstern’s bivariate distribution and its application to point processes. J. 631
Math. Anal. Appl. 65, 247–256 (1978). 632
30. Jackson, D. A. Stopping Rules in Principal Components Analysis : A Comparison of 633
Heuristical and Statistical Approaches. Ecology 74, 2204–2214 (1993). 634
31. Zotenko, E., Mestre, J., O’Leary, D. P. & Przytycka, T. M. Why do hubs in the yeast 635
protein interaction network tend to be essential: Reexamining the connection between the 636
network topology and essentiality. PLoS Comput. Biol. 4, (2008). 637
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
31
32. Li, H., You, L., Xie, J., Pan, H. & Han, W. The roles of subcellularly located EGFR in 638
autophagy. Cell. Signal. 35, 223–230 (2017). 639
33. Sooro, M. A., Zhang, N. & Zhang, P. Targeting EGFR-mediated autophagy as a potential 640
strategy for cancer therapy. Int. J. Cancer 143, 2116–2125 (2018). 641
34. Liu, Q. et al. Role of EGFL7/EGFR-signaling pathway in migration and invasion of 642
growth hormone-producing pituitary adenomas. Sci. China Life Sci. 61, 893–901 (2018). 643
35. Piddock, R. E. et al. PI3Kδ and PI3Kγ isoforms have distinct functions in regulating pro-644
tumoural signalling in the multiple myeloma microenvironment. Blood Cancer J. 7, e539–645
e539 (2017). 646
36. Deane, J. A. & Fruman, D. A. PHOSPHOINOSITIDE 3-KINASE: Diverse Roles in 647
Immune Cell Activation. Annu. Rev. Immunol. 22, 563–598 (2004). 648
37. Burger, J. A. & Wiestner, A. Targeting B cell receptor signalling in cancer: preclinical and 649
clinical advances. Nat. Rev. Cancer 18, 148–167 (2018). 650
38. Guerrero-Martínez, J. A. & Reyes, J. C. High expression of SMARCA4 or SMARCA2 is 651
frequently associated with an opposite prognosis in cancer. Sci. Rep. 8, 2043 (2018). 652
39. Hill, D. A., De La Serna, I. L., Veal, T. M. & Imbalzano, A. N. BRCA1 interacts with 653
dominant negative SWI/SNF enzymes without affecting homologous recombination or 654
radiation-induced gene activation of p21 or Mdm2. J. Cell. Biochem. 91, 987–998 (2004). 655
40. Strobeck, M. W. et al. The BRG-1 Subunit of the SWI/SNF Complex Regulates CD44 656
Expression. J. Biol. Chem. 276, 9273–9278 (2001). 657
41. Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950). 658
42. Carter, J. V., Pan, J., Rai, S. N. & Galandiuk, S. ROC-ing along: Evaluation and 659
interpretation of receiver operating characteristic curves. Surgery 159, 1638–1645 (2016). 660
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
32
43. López-Ratón, M., Rodríguez-Álvarez, M. X., Cadarso-Suárez, C., Gude-Sampedro, F. & 661
others. OptimalCutpoints: an R package for selecting optimal cutpoints in diagnostic tests. 662
J. Stat. Softw. 61, 1–36 (2014). 663
44. Hippenstiel, R. D. Detection theory: applications and digital signal processing. (CRC 664
Press, 2001). 665
45. Yoav, B. & Yosef, H. Controlling the false discovery rate: a practical and powerful 666
approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995). 667
46. Hnisz, D. et al. Super-enhancers in the control of cell identity and disease. Cell 155, 934–668
47 (2013). 669
47. Bader, G. D. & Hogue, C. W. V. An automated method for finding molecular complexes 670
in large protein interaction networks. BMC Bioinformatics 4, 2 (2003). 671
48. Feng, D., Baumgartner, R. & Svetnik, V. A bayesian framework for estimating the 672
concordance correlation coefficient using skew-elliptical distributions. Int. J. Biostat. 14, 673
(2018). 674
49. Oksanen, J. et al. vegan: Community Ecology Package. (2018). 675
50. Jun Yan. Enjoy the Joy of Copulas: With a Package copula. J. Stat. Softw. 21, 1–21 676
(2007). 677
51. Stevens, J. P. Applied Multivariate Statistics for the Social Sciences. (Routledge 678
Academic, 2009). 679
680
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
33
Tables 681
Table 1. First 12 genes with the top PC-score based on density of methylation level differences 682 and density of Hellinger divergences*. 683
Density of meth. level differences Density of Hellinger divergence
Gene PC-score Signal density variation† Gene PC-score Signal density
variation COX8C 53.23 23.30 COX8C 55.10 23.30 MSC 27.02 10.50 MSC 22.14 10.50 MPEG1 16.11 8.87 MPEG1 17.36 8.87 P2RY1 15.47 5.80 BLACE 12.97 6.37 CLEC11A 15.20 6.60 CTGF 11.96 3.75 BLACE 13.20 6.37 UHRF1 11.26 5.26 UHRF1 12.08 5.26 P2RY1 11.02 5.80 EGFL7 11.95 5.64 CMTM2 9.52 3.68 ID4 11.80 5.15 CXCR5 9.34 4.63 CDK5R1 9.50 6.76 ID4 9.31 5.15 CTGF 9.13 3.75 DDIT4L 8.77 2.65
*The entire table and details are given in Supplementary Table S2. †Signal density variation for each gene is given in 684 the output of MethylIT function countTest2. This is the group mean difference of the normalized number of DMPs 685 in 1kb. 686 687
688
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
34
Supplementary Figures 689
Supplementary Figure S1. Distribution of methylation changes on chromosome and gene-body. 690
A, Distribution of methylation changes at DMP positions on selected chromosomes as viewed 691
within genome browser. B, boxplot of the means of methylation levels on chromosomes and at 692
genes. In all cases, patient (P) data are in blue and control (C) in green. 693
694
Supplementary Figure S2. PPI networks built on the subset of 285 network-related DMGs. The 695
size of each node is proportional to its value of betweeness centrality and the label font size is 696
proportional to its node degree. Node colors from light-green to red maps the discrete scale of 697
logarithm base 2 of fold change in DMP number for the corresponding gene: light-green: [1, 2), 698
cyan: [2, 3), blue: [3, 4), and red: 5 or more. B, a subnetwork with minor hubs (101 DMGs). C, a 699
cluster (139 DMGs) integrated by two subnetworks. 700
701
Supplementary Figure S3. PPI subnetwork module derived with Cytoscape app MCODE from 702
the PPI network of 1775 DEG-DMGs. Node colors from yellow to red maps the discrete scale of 703
logarithm base 2 of fold changes in gene expression for the corresponding gene: yellow: lesser or 704
equal to -6, …, light-green: (-2, -1], …, cyan: (2, 3], … blue: (4, 5], …, red: 10 or more. 705
706
Supplementary Figure S4. Sub-networks derived with K-means clustering from the subset of 707
285 network-related DMGs. 708
709
Supplementary Figure S5. Network enrichment analysis on KEGG pathways for module 710
derived with Cytoscape app MCODE from the PPI network of 1775 DEG-DMGs.. 711
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
35
712
Supplementary Figure S6. PPI networks on the set of 191 network related DEG-DMGs. The 713
PPI network was built with Cytoscape 11,12 from a subset of 191 DEG-DMGs previously 714
obtained by applying network-based enrichment analysis 51. Nodes with the same color belong 715
to the same cluster obtained by K-means clustering. 716
717
Supplementary Figure S7. Association between methylation and gene expression. A, 718
Spearman's rank correlation rho between variables for absolute value of logarithm base 2 of fold 719
change (𝑙𝑙𝑙𝑙𝑙𝑙2𝐹𝐹𝐹𝐹) in gene expression at DEG-DMGs and the differences in methylation densities 720
(between control and patient groups). All correlations are statistically significant (p-value lesser 721
than 0.001). The variables analyzed are the absolute difference (∆) of: 𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑: density of 722
methylation levels, 𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑: density of the difference of methylation levels, 𝑇𝑇𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑: TV 723
with Bayesian correction, and 𝐻𝐻𝐻𝐻𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑: density of Hellinger divergence of methylation levels. 724
B, D, and F panels show two-dimensional kernel estimations (2D-KDE) of the joint probability 725
distribution for each annotated pair of variables in the coordinate axes from the contour-plot 726
plane (see main text for variable description). C and E panels: Farlie-Gumbel-Morgenstern 727
(FGM) copula joint probability distribution built from the estimation of marginals distribution 728
(XZ plane: Gamma probability distribution and YZ plane: generalized gamma distribution). 729
Together, panels A to F indicate that, in the current study of patients with PALL, methylation 730
and gene expression are not statistically independent, but associated with statistically highly 731
significant linear trend, located with high joint probability in the outlined contour-plot red 732
regions. 733
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint
36
Supplementary Tables 734
Supplementary Table S1: Excel files containing Tables S1-A to S1-G. 735
736
Supplementary Table S2: Excel files containing Tables S2-A to S1-I. 737
738
Supplementary File S1. zip file containing the wig files with tracks for the group means of the 739
differences of methylation levels between each group and the reference group (four independent 740
normal CD19+ blood cell donor): control (four normal CD19+ blood cell donor) versus 741
reference, and patients (ALL cells from three patients) versus reference. 742
743
.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint