Post on 25-Jul-2020
De novo detection and accurate inference of differentially methylated regions
Keegan Korthauer, PhD
Joint Statistical Meetings, Vancouver, CA29 July 2018
Whole Genome Bisulfite Sequencing (WGBS)
Bisulfitesequencingreads
Reference genome
Methylation Sequencing Data
CpG 1 CpG 2 CpG 3 CpG 4 CpG 5
Methylated CpG Unmethylated CpG
CpG 1 CpG 2 CpG 3 CpG 4 CpG 5
Methylated Count (M) 0 3 3 1 0
Coverage (N) 1 3 4 4 2
Proportion (M/N) 0 1 0.75 0.25 0
Extract DNAfrom many cells
2
Differentially Methylated Regions (DMRs)
enrichment of H3K27me3 (odds-ratio = 2.16, P < 2.2 3 10!16 andpresent in 72% of all boundaries), but also a number of other marksinvolved in either gene regulation or chromatin stability duringmitosis. BATF (odds-ratio = 1.55, P < 2.2 3 10!16, present in 15% ofblock boundaries) encodes a transcription factor that complexeswith IRF4 in targeting genes during immune activation (Glasmacheret al. 2012). H4K20me1 (odds-ratio = 1.36, P < 2.2 3 10!16, present in48% of block boundaries) is thought to regulate S-phase pro-gression and genome stability (Jorgensen et al. 2013) and isa WNT-signaling mediator (Li et al. 2011). EZH2 (odds-ratio =1.59, P < 2.2 3 10!16, present in 66% of block boundaries)
encodes a canonical histone methyltransferase for H3K27me3and is both overexpressed and mutated in lymphomas andother neoplasms (Chase and Cross 2011). The protein encodedby RAD21 (odds-ratio = 1.43, P < 2 3 10!16, present in 30% ofblock boundaries) associates with mitotic chromatin stability(Deardorff et al. 2012). The protein encoded by BCL11A (odds-ratio = 1.44, P < 2 3 10!16, present in 11% of block boundaries)has been shown to interact with a complex of transcriptionalcorepressors (RCOR1/KDM1A) in modulating hemoglobinswitching and fetal hemoglobin silencing (Xu et al. 2013). Notethat since ENCODE only profiled EBV-transformed cells, we
Figure 2. Large hypomethylated genomic blocks in EBV-immortalized B-cells. (A) Smoothed methylation values from bisulfite sequencing data forquiescent (dark blue), activated (light blue), and EBV immortalized (red) B-cells, top panel. The smoothed methylation values estimate average DNAmethylation on the kilobase scale. Hypomethylated EBV blocks are demarcated in pink shading. The bottom panel shows smoothed DNA methylationvalues for normal colon (purple) and colon tumor (orange) samples, from Hansen et al. (2011). (B) Genome-wide distribution of DNA methylation. Thelarge block domains appear as a large bump around 0.6. (C ) Simulations show that block locations co-occur. For each of the three EBV transformedsamples, we find sample-specific blocks by comparing the sample in question to all three activated samples. For each set of sample-specific blocks, wecomputed the distance from the observed start position of each sample-specific block to the closest start position in the other two sets. The boxplot on the leftshows the distribution of these distances, pooled across all six comparisons. The boxplot on the right shows the expected distribution of distances under thenull hypothesis that the block start positions do not agree. The smaller values seen in the left boxplot demonstrates that the start positions of the sample-specific blocks co-occur much more frequently than expected by chance. (D) Enrichment of hypervariable genes in EBV-transformed cell lines, inside EBVblocks. The x-axis denotes a standard deviation cutoff, above which genes are considered hypervariable. The y-axis is the log2 odds ratio of enrichment ofthese hypervariable genes inside EBV blocks. The gray shaded area is a 95% confidence interval, and values above 0 mark enrichment.
EBV-induced hypomethylated blocks
Genome Research 179www.genome.org
Cold Spring Harbor Laboratory Press on December 23, 2015 - Published by genome.cshlp.orgDownloaded from
Adapted from Hansen et al. 2014, Genome Research
enrichment of H3K27me3 (odds-ratio = 2.16, P < 2.2 3 10!16 andpresent in 72% of all boundaries), but also a number of other marksinvolved in either gene regulation or chromatin stability duringmitosis. BATF (odds-ratio = 1.55, P < 2.2 3 10!16, present in 15% ofblock boundaries) encodes a transcription factor that complexeswith IRF4 in targeting genes during immune activation (Glasmacheret al. 2012). H4K20me1 (odds-ratio = 1.36, P < 2.2 3 10!16, present in48% of block boundaries) is thought to regulate S-phase pro-gression and genome stability (Jorgensen et al. 2013) and isa WNT-signaling mediator (Li et al. 2011). EZH2 (odds-ratio =1.59, P < 2.2 3 10!16, present in 66% of block boundaries)
encodes a canonical histone methyltransferase for H3K27me3and is both overexpressed and mutated in lymphomas andother neoplasms (Chase and Cross 2011). The protein encodedby RAD21 (odds-ratio = 1.43, P < 2 3 10!16, present in 30% ofblock boundaries) associates with mitotic chromatin stability(Deardorff et al. 2012). The protein encoded by BCL11A (odds-ratio = 1.44, P < 2 3 10!16, present in 11% of block boundaries)has been shown to interact with a complex of transcriptionalcorepressors (RCOR1/KDM1A) in modulating hemoglobinswitching and fetal hemoglobin silencing (Xu et al. 2013). Notethat since ENCODE only profiled EBV-transformed cells, we
Figure 2. Large hypomethylated genomic blocks in EBV-immortalized B-cells. (A) Smoothed methylation values from bisulfite sequencing data forquiescent (dark blue), activated (light blue), and EBV immortalized (red) B-cells, top panel. The smoothed methylation values estimate average DNAmethylation on the kilobase scale. Hypomethylated EBV blocks are demarcated in pink shading. The bottom panel shows smoothed DNA methylationvalues for normal colon (purple) and colon tumor (orange) samples, from Hansen et al. (2011). (B) Genome-wide distribution of DNA methylation. Thelarge block domains appear as a large bump around 0.6. (C ) Simulations show that block locations co-occur. For each of the three EBV transformedsamples, we find sample-specific blocks by comparing the sample in question to all three activated samples. For each set of sample-specific blocks, wecomputed the distance from the observed start position of each sample-specific block to the closest start position in the other two sets. The boxplot on the leftshows the distribution of these distances, pooled across all six comparisons. The boxplot on the right shows the expected distribution of distances under thenull hypothesis that the block start positions do not agree. The smaller values seen in the left boxplot demonstrates that the start positions of the sample-specific blocks co-occur much more frequently than expected by chance. (D) Enrichment of hypervariable genes in EBV-transformed cell lines, inside EBVblocks. The x-axis denotes a standard deviation cutoff, above which genes are considered hypervariable. The y-axis is the log2 odds ratio of enrichment ofthese hypervariable genes inside EBV blocks. The gray shaded area is a 95% confidence interval, and values above 0 mark enrichment.
EBV-induced hypomethylated blocks
Genome Research 179www.genome.org
Cold Spring Harbor Laboratory Press on December 23, 2015 - Published by genome.cshlp.orgDownloaded from
Chromosome 8: 31,442,644– 39,442,643
Chromosome 1: 235,431,162 – 243,431,161
enrichment of H3K27me3 (odds-ratio = 2.16, P < 2.2 3 10!16 andpresent in 72% of all boundaries), but also a number of other marksinvolved in either gene regulation or chromatin stability duringmitosis. BATF (odds-ratio = 1.55, P < 2.2 3 10!16, present in 15% ofblock boundaries) encodes a transcription factor that complexeswith IRF4 in targeting genes during immune activation (Glasmacheret al. 2012). H4K20me1 (odds-ratio = 1.36, P < 2.2 3 10!16, present in48% of block boundaries) is thought to regulate S-phase pro-gression and genome stability (Jorgensen et al. 2013) and isa WNT-signaling mediator (Li et al. 2011). EZH2 (odds-ratio =1.59, P < 2.2 3 10!16, present in 66% of block boundaries)
encodes a canonical histone methyltransferase for H3K27me3and is both overexpressed and mutated in lymphomas andother neoplasms (Chase and Cross 2011). The protein encodedby RAD21 (odds-ratio = 1.43, P < 2 3 10!16, present in 30% ofblock boundaries) associates with mitotic chromatin stability(Deardorff et al. 2012). The protein encoded by BCL11A (odds-ratio = 1.44, P < 2 3 10!16, present in 11% of block boundaries)has been shown to interact with a complex of transcriptionalcorepressors (RCOR1/KDM1A) in modulating hemoglobinswitching and fetal hemoglobin silencing (Xu et al. 2013). Notethat since ENCODE only profiled EBV-transformed cells, we
Figure 2. Large hypomethylated genomic blocks in EBV-immortalized B-cells. (A) Smoothed methylation values from bisulfite sequencing data forquiescent (dark blue), activated (light blue), and EBV immortalized (red) B-cells, top panel. The smoothed methylation values estimate average DNAmethylation on the kilobase scale. Hypomethylated EBV blocks are demarcated in pink shading. The bottom panel shows smoothed DNA methylationvalues for normal colon (purple) and colon tumor (orange) samples, from Hansen et al. (2011). (B) Genome-wide distribution of DNA methylation. Thelarge block domains appear as a large bump around 0.6. (C ) Simulations show that block locations co-occur. For each of the three EBV transformedsamples, we find sample-specific blocks by comparing the sample in question to all three activated samples. For each set of sample-specific blocks, wecomputed the distance from the observed start position of each sample-specific block to the closest start position in the other two sets. The boxplot on the leftshows the distribution of these distances, pooled across all six comparisons. The boxplot on the right shows the expected distribution of distances under thenull hypothesis that the block start positions do not agree. The smaller values seen in the left boxplot demonstrates that the start positions of the sample-specific blocks co-occur much more frequently than expected by chance. (D) Enrichment of hypervariable genes in EBV-transformed cell lines, inside EBVblocks. The x-axis denotes a standard deviation cutoff, above which genes are considered hypervariable. The y-axis is the log2 odds ratio of enrichment ofthese hypervariable genes inside EBV blocks. The gray shaded area is a 95% confidence interval, and values above 0 mark enrichment.
EBV-induced hypomethylated blocks
Genome Research 179www.genome.org
Cold Spring Harbor Laboratory Press on December 23, 2015 - Published by genome.cshlp.orgDownloaded from
Genomic Location
Met
hyla
tion
leve
l (pr
opor
tion)
3
Previous methods: Grouping significant CpGs
4
Genomic location
Met
hyla
tion
Diff
eren
ce
0
Examples:− Bsmooth (Hansen et al., 2012)− DSS (Feng et al., 2014; Wu et al., 2015)
Significant
Not Significant
Results from testing individual CpGs:
Error rate not controlled at the region level
Genomic location
Met
hyla
tion
Diff
eren
ce
0
True Positive (TP)False Positive (FP)True Negative (TN)
⚠
DMR
5
False Discovery Rate (012) = 5 0676 + 06
9012:;< =28 = 0.25 BC 9012DEF =
12 = 0.50
0
Met
hyla
tion
Diff
eren
ce
Genome-wide scan of CpG methylation difference
Lower bound for biological significance
Local likelihood smoother with coverage weights
Genomic Location
dmrseq: (1) Detect de novo candidate regions
Candidate Regions
6
�
�
�
�Obs
erve
d
�
�
�
�
Perm
utat
ion
1
� �
� �Pe
rmut
atio
n 2
Nul
l
Null Region Statistics (Pooled)
Observed Statistic
7
dmrseq: (2) Assess region-level signal
− Formulate region-level summary statistic
− Compare region statistics against null permutation distribution to evaluate significance
CpG level: !"#$|&"#$, ("#$ ~ *+,(&"#$, ("#$)("#$ ~ */01(1"$2, 3"$2)
4"$2 =6789
(6789:;789)
!"#$ = methylated read count&"#$ = total coverage("#$ = methylation proportion 4"$2 = methylation proportion for condition <
Region level: = >$ = ?@$= ∑BCD
E8 FGB$1["CB] + L#FD$
MG: FD$ = 0
Region-level modeling
+ , P , P ∈ R2< , , ,
loci-specific intercept condition effect
8
Generalized Least Squares (GLS) with variance stabilizing transformation:
arcsine link transformation (Park & Wu 2016)!"#$ = &'()*+(2 ⁄/"#$ 0"#$ − 1)
3&'( ⁄/"#$ 0"#$) ∝6"#$ 1 − 6"#$ but 3&'(!"#$) ≈89(:;<=>8 )?;=@
:;<=
A$ = BC$ + E$where F E$ = G and 3&' E$ = H$
ICJ = BKHJ>LB>LHJ>LBKHJ>LAJ
Region-level model fitting
Variance depends on mean Variance independent of mean
9
Account for variability across samples and locations
(1) Correlation: Continuous Autoregressive (CAR) model
! "#$%, "'$% = )*+,|./,*.0,|1#% = genomic location of CpG 2
10
(2) Variability dependent on coverage
345 "#$% ∝ 18#9%
(3) Within sample correlation
Independentsamples :;< "#$%, "#$∗% = 0
Example: highly ranked DMR across all methods
Met
hyla
tion
0.2
0.5
0.8
Left
Vent
ricle
Smal
l Int
estin
e
chr5: 135,027,359 − 135,030,207 (width = 2,849)
dmrseq: qval=0.005BSmooth: area rank=0.001, mean rank=0
DSS: area rank=0, mean rank=0.012Metilene: qval=0
Met
hyla
tion
0.2
0.5
0.8
Rig
ht V
entri
cle
Sigm
oid
Col
on
chr16: 85,551,503 − 85,552,644 (width = 1,142)
dmrseq: qval=0.722BSmooth: area rank=0.383, mean rank=0.029
DSS: area rank=0.118, mean rank=0.185Metilene: qval=0.004
Met
hyla
tion
0.2
0.5
0.8
Left
Vent
ricle
Smal
l Int
estin
e
chr10: 119,671,164 − 119,671,495 (width = 332)
dmrseq: qval=0.51BSmooth: area rank=0.01, mean rank=0.349
DSS: area rank=0.014, mean rank=0.474Metilene: qval=1
(A)
(B)
(C)
1
11Korthauer et al., 2018
Example: dmrseq accounts for sample variability
Met
hyla
tion
0.2
0.5
0.8
Left
Vent
ricle
Smal
l Int
estin
e
chr5: 135,027,359 − 135,030,207 (width = 2,849)
dmrseq: qval=0.005BSmooth: area rank=0.001, mean rank=0
DSS: area rank=0, mean rank=0.012Metilene: qval=0
Met
hyla
tion
0.2
0.5
0.8
Rig
ht V
entri
cle
Sigm
oid
Col
on
chr16: 85,551,503 − 85,552,644 (width = 1,142)
dmrseq: qval=0.722BSmooth: area rank=0.383, mean rank=0.029
DSS: area rank=0.118, mean rank=0.185Metilene: qval=0.004
Met
hyla
tion
0.2
0.5
0.8
Left
Vent
ricle
Smal
l Int
estin
e
chr10: 119,671,164 − 119,671,495 (width = 332)
dmrseq: qval=0.51BSmooth: area rank=0.01, mean rank=0.349
DSS: area rank=0.014, mean rank=0.474Metilene: qval=1
(A)
(B)
(C)
1
12Korthauer et al., 2018
Accurate FDR control in simulation
0.0
0.1
0.2
0.3
0.4
0.0 0.1 0.2 0.3 0.4Specified FDR level
Obs
erve
d FD
R le
vel
(B) FDR control by dmrseq
SimulationD2
D3
0.0
0.1
0.2
0.3
0.4
0.0 0.1 0.2 0.3 0.4Specified FDR level
Obs
erve
d FD
R le
vel
(B) FDR control by dmrseq
SimulationD2
D3
2 vs 2 Comparison
3 vs 3 Comparison
Specified FDR ⍺ level
Obs
erve
d FD
R le
vel
13
High sensitivity and specificity in simulation
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00FDR
Powe
r
(A) Simulation D2
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00FDR
(B) Simulation D3
MethodBSmooth
dmrseq
DSS
metilene
2 vs 2 Comparison 3 vs 3 Comparison
Observed FDR level
Pow
er
140.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00FDR
Powe
r
(A) Simulation D2
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00FDR
(B) Simulation D3
MethodBSmooth
dmrseq
DSS
metilene0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00FDR
Powe
r
(A) Simulation D2
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00FDR
(B) Simulation D3
MethodBSmooth
dmrseq
DSS
metilene
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00FDR
Powe
r
(A) Simulation D2
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00FDR
(B) Simulation D3
MethodBSmooth
dmrseq
DSS
metilene
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00FDR
Powe
r
(A) Simulation D2
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00FDR
(B) Simulation D3
MethodBSmooth
dmrseq
DSS
metilene
Significant DMRs enriched for biological signal
15
Methylation Increase à
ßEx
pres
sion
Dec
reas
e
Data from Ford et al., 2017, bioRxiv
16
Significant DMRs enriched for biological signal
Methylation Increase à
ßEx
pres
sion
Dec
reas
e
Data from Ford et al., 2017, bioRxiv
17
Significant DMRs enriched for biological signal
Methylation Increase à
ßEx
pres
sion
Dec
reas
e
Data from Ford et al., 2017, bioRxiv
18
Significant DMRs enriched for biological signal
Methylation Increase à
ßEx
pres
sion
Dec
reas
e
Data from Ford et al., 2017, bioRxiv
19
80%
90%
100%
50100
250
500
1000
1500
2000
3000
4000
5000
DMR Rank
Perc
ent o
f gen
es re
pres
sed
Methoddmrseq
DSS
Increased biological signal in dmrseq DMRs
Data from Ford et al., 2017, bioRxiv
Summary
- dmrseq identifies and prioritizes DMRs from bisulfite sequencing experiments
- Models signal at the region level in order to account for sample and spatial variability
- Achieves accurate False Discovery Rate control by generating a null distribution that pools information across the genome
- Detailed in “Detection and accurate False Discovery Rate control of differentially methylated regions from Whole Genome Bisulfite Sequencing” (Biostatistics, 2018)
- dmrseq R package available on Bioconductor
20
Dana-Farber/Harvard ChanRafael IrizarryClaire Duvallet
Stephanie Hicks
Patrick Kimes
Yered Pita-Juarez
Alejandro Reyes
Chinmay Shukla
Mingxiang Teng
CollaboratorsSutirtha ChakrabortyYuval Benjamini
Acknowledgements
Contact
keegan@jimmy.harvard.edu
keegankorthauer
kkorthauer.org!