De novo detection and accurate inference of differentially ... ·...

21
De novo detection and accurate inference of differentially methylated regions Keegan Korthauer, PhD Joint Statistical Meetings, Vancouver, CA 29 July 2018

Transcript of De novo detection and accurate inference of differentially ... ·...

Page 1: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

De novo detection and accurate inference of differentially methylated regions

Keegan Korthauer, PhD

Joint Statistical Meetings, Vancouver, CA29 July 2018

Page 2: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Whole Genome Bisulfite Sequencing (WGBS)

Bisulfitesequencingreads

Reference genome

Methylation Sequencing Data

CpG 1 CpG 2 CpG 3 CpG 4 CpG 5

Methylated CpG Unmethylated CpG

CpG 1 CpG 2 CpG 3 CpG 4 CpG 5

Methylated Count (M) 0 3 3 1 0

Coverage (N) 1 3 4 4 2

Proportion (M/N) 0 1 0.75 0.25 0

Extract DNAfrom many cells

2

Page 3: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Differentially Methylated Regions (DMRs)

enrichment of H3K27me3 (odds-ratio = 2.16, P < 2.2 3 10!16 andpresent in 72% of all boundaries), but also a number of other marksinvolved in either gene regulation or chromatin stability duringmitosis. BATF (odds-ratio = 1.55, P < 2.2 3 10!16, present in 15% ofblock boundaries) encodes a transcription factor that complexeswith IRF4 in targeting genes during immune activation (Glasmacheret al. 2012). H4K20me1 (odds-ratio = 1.36, P < 2.2 3 10!16, present in48% of block boundaries) is thought to regulate S-phase pro-gression and genome stability (Jorgensen et al. 2013) and isa WNT-signaling mediator (Li et al. 2011). EZH2 (odds-ratio =1.59, P < 2.2 3 10!16, present in 66% of block boundaries)

encodes a canonical histone methyltransferase for H3K27me3and is both overexpressed and mutated in lymphomas andother neoplasms (Chase and Cross 2011). The protein encodedby RAD21 (odds-ratio = 1.43, P < 2 3 10!16, present in 30% ofblock boundaries) associates with mitotic chromatin stability(Deardorff et al. 2012). The protein encoded by BCL11A (odds-ratio = 1.44, P < 2 3 10!16, present in 11% of block boundaries)has been shown to interact with a complex of transcriptionalcorepressors (RCOR1/KDM1A) in modulating hemoglobinswitching and fetal hemoglobin silencing (Xu et al. 2013). Notethat since ENCODE only profiled EBV-transformed cells, we

Figure 2. Large hypomethylated genomic blocks in EBV-immortalized B-cells. (A) Smoothed methylation values from bisulfite sequencing data forquiescent (dark blue), activated (light blue), and EBV immortalized (red) B-cells, top panel. The smoothed methylation values estimate average DNAmethylation on the kilobase scale. Hypomethylated EBV blocks are demarcated in pink shading. The bottom panel shows smoothed DNA methylationvalues for normal colon (purple) and colon tumor (orange) samples, from Hansen et al. (2011). (B) Genome-wide distribution of DNA methylation. Thelarge block domains appear as a large bump around 0.6. (C ) Simulations show that block locations co-occur. For each of the three EBV transformedsamples, we find sample-specific blocks by comparing the sample in question to all three activated samples. For each set of sample-specific blocks, wecomputed the distance from the observed start position of each sample-specific block to the closest start position in the other two sets. The boxplot on the leftshows the distribution of these distances, pooled across all six comparisons. The boxplot on the right shows the expected distribution of distances under thenull hypothesis that the block start positions do not agree. The smaller values seen in the left boxplot demonstrates that the start positions of the sample-specific blocks co-occur much more frequently than expected by chance. (D) Enrichment of hypervariable genes in EBV-transformed cell lines, inside EBVblocks. The x-axis denotes a standard deviation cutoff, above which genes are considered hypervariable. The y-axis is the log2 odds ratio of enrichment ofthese hypervariable genes inside EBV blocks. The gray shaded area is a 95% confidence interval, and values above 0 mark enrichment.

EBV-induced hypomethylated blocks

Genome Research 179www.genome.org

Cold Spring Harbor Laboratory Press on December 23, 2015 - Published by genome.cshlp.orgDownloaded from

Adapted from Hansen et al. 2014, Genome Research

enrichment of H3K27me3 (odds-ratio = 2.16, P < 2.2 3 10!16 andpresent in 72% of all boundaries), but also a number of other marksinvolved in either gene regulation or chromatin stability duringmitosis. BATF (odds-ratio = 1.55, P < 2.2 3 10!16, present in 15% ofblock boundaries) encodes a transcription factor that complexeswith IRF4 in targeting genes during immune activation (Glasmacheret al. 2012). H4K20me1 (odds-ratio = 1.36, P < 2.2 3 10!16, present in48% of block boundaries) is thought to regulate S-phase pro-gression and genome stability (Jorgensen et al. 2013) and isa WNT-signaling mediator (Li et al. 2011). EZH2 (odds-ratio =1.59, P < 2.2 3 10!16, present in 66% of block boundaries)

encodes a canonical histone methyltransferase for H3K27me3and is both overexpressed and mutated in lymphomas andother neoplasms (Chase and Cross 2011). The protein encodedby RAD21 (odds-ratio = 1.43, P < 2 3 10!16, present in 30% ofblock boundaries) associates with mitotic chromatin stability(Deardorff et al. 2012). The protein encoded by BCL11A (odds-ratio = 1.44, P < 2 3 10!16, present in 11% of block boundaries)has been shown to interact with a complex of transcriptionalcorepressors (RCOR1/KDM1A) in modulating hemoglobinswitching and fetal hemoglobin silencing (Xu et al. 2013). Notethat since ENCODE only profiled EBV-transformed cells, we

Figure 2. Large hypomethylated genomic blocks in EBV-immortalized B-cells. (A) Smoothed methylation values from bisulfite sequencing data forquiescent (dark blue), activated (light blue), and EBV immortalized (red) B-cells, top panel. The smoothed methylation values estimate average DNAmethylation on the kilobase scale. Hypomethylated EBV blocks are demarcated in pink shading. The bottom panel shows smoothed DNA methylationvalues for normal colon (purple) and colon tumor (orange) samples, from Hansen et al. (2011). (B) Genome-wide distribution of DNA methylation. Thelarge block domains appear as a large bump around 0.6. (C ) Simulations show that block locations co-occur. For each of the three EBV transformedsamples, we find sample-specific blocks by comparing the sample in question to all three activated samples. For each set of sample-specific blocks, wecomputed the distance from the observed start position of each sample-specific block to the closest start position in the other two sets. The boxplot on the leftshows the distribution of these distances, pooled across all six comparisons. The boxplot on the right shows the expected distribution of distances under thenull hypothesis that the block start positions do not agree. The smaller values seen in the left boxplot demonstrates that the start positions of the sample-specific blocks co-occur much more frequently than expected by chance. (D) Enrichment of hypervariable genes in EBV-transformed cell lines, inside EBVblocks. The x-axis denotes a standard deviation cutoff, above which genes are considered hypervariable. The y-axis is the log2 odds ratio of enrichment ofthese hypervariable genes inside EBV blocks. The gray shaded area is a 95% confidence interval, and values above 0 mark enrichment.

EBV-induced hypomethylated blocks

Genome Research 179www.genome.org

Cold Spring Harbor Laboratory Press on December 23, 2015 - Published by genome.cshlp.orgDownloaded from

Chromosome 8: 31,442,644– 39,442,643

Chromosome 1: 235,431,162 – 243,431,161

enrichment of H3K27me3 (odds-ratio = 2.16, P < 2.2 3 10!16 andpresent in 72% of all boundaries), but also a number of other marksinvolved in either gene regulation or chromatin stability duringmitosis. BATF (odds-ratio = 1.55, P < 2.2 3 10!16, present in 15% ofblock boundaries) encodes a transcription factor that complexeswith IRF4 in targeting genes during immune activation (Glasmacheret al. 2012). H4K20me1 (odds-ratio = 1.36, P < 2.2 3 10!16, present in48% of block boundaries) is thought to regulate S-phase pro-gression and genome stability (Jorgensen et al. 2013) and isa WNT-signaling mediator (Li et al. 2011). EZH2 (odds-ratio =1.59, P < 2.2 3 10!16, present in 66% of block boundaries)

encodes a canonical histone methyltransferase for H3K27me3and is both overexpressed and mutated in lymphomas andother neoplasms (Chase and Cross 2011). The protein encodedby RAD21 (odds-ratio = 1.43, P < 2 3 10!16, present in 30% ofblock boundaries) associates with mitotic chromatin stability(Deardorff et al. 2012). The protein encoded by BCL11A (odds-ratio = 1.44, P < 2 3 10!16, present in 11% of block boundaries)has been shown to interact with a complex of transcriptionalcorepressors (RCOR1/KDM1A) in modulating hemoglobinswitching and fetal hemoglobin silencing (Xu et al. 2013). Notethat since ENCODE only profiled EBV-transformed cells, we

Figure 2. Large hypomethylated genomic blocks in EBV-immortalized B-cells. (A) Smoothed methylation values from bisulfite sequencing data forquiescent (dark blue), activated (light blue), and EBV immortalized (red) B-cells, top panel. The smoothed methylation values estimate average DNAmethylation on the kilobase scale. Hypomethylated EBV blocks are demarcated in pink shading. The bottom panel shows smoothed DNA methylationvalues for normal colon (purple) and colon tumor (orange) samples, from Hansen et al. (2011). (B) Genome-wide distribution of DNA methylation. Thelarge block domains appear as a large bump around 0.6. (C ) Simulations show that block locations co-occur. For each of the three EBV transformedsamples, we find sample-specific blocks by comparing the sample in question to all three activated samples. For each set of sample-specific blocks, wecomputed the distance from the observed start position of each sample-specific block to the closest start position in the other two sets. The boxplot on the leftshows the distribution of these distances, pooled across all six comparisons. The boxplot on the right shows the expected distribution of distances under thenull hypothesis that the block start positions do not agree. The smaller values seen in the left boxplot demonstrates that the start positions of the sample-specific blocks co-occur much more frequently than expected by chance. (D) Enrichment of hypervariable genes in EBV-transformed cell lines, inside EBVblocks. The x-axis denotes a standard deviation cutoff, above which genes are considered hypervariable. The y-axis is the log2 odds ratio of enrichment ofthese hypervariable genes inside EBV blocks. The gray shaded area is a 95% confidence interval, and values above 0 mark enrichment.

EBV-induced hypomethylated blocks

Genome Research 179www.genome.org

Cold Spring Harbor Laboratory Press on December 23, 2015 - Published by genome.cshlp.orgDownloaded from

Genomic Location

Met

hyla

tion

leve

l (pr

opor

tion)

3

Page 4: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Previous methods: Grouping significant CpGs

4

Genomic location

Met

hyla

tion

Diff

eren

ce

0

Examples:− Bsmooth (Hansen et al., 2012)− DSS (Feng et al., 2014; Wu et al., 2015)

Significant

Not Significant

Results from testing individual CpGs:

Page 5: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Error rate not controlled at the region level

Genomic location

Met

hyla

tion

Diff

eren

ce

0

True Positive (TP)False Positive (FP)True Negative (TN)

DMR

5

False Discovery Rate (012) = 5 0676 + 06

9012:;< =28 = 0.25 BC 9012DEF =

12 = 0.50

Page 6: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

0

Met

hyla

tion

Diff

eren

ce

Genome-wide scan of CpG methylation difference

Lower bound for biological significance

Local likelihood smoother with coverage weights

Genomic Location

dmrseq: (1) Detect de novo candidate regions

Candidate Regions

6

Page 7: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

�Obs

erve

d

Perm

utat

ion

1

� �

� �Pe

rmut

atio

n 2

Nul

l

Null Region Statistics (Pooled)

Observed Statistic

7

dmrseq: (2) Assess region-level signal

− Formulate region-level summary statistic

− Compare region statistics against null permutation distribution to evaluate significance

Page 8: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

CpG level: !"#$|&"#$, ("#$ ~ *+,(&"#$, ("#$)("#$ ~ */01(1"$2, 3"$2)

4"$2 =6789

(6789:;789)

!"#$ = methylated read count&"#$ = total coverage("#$ = methylation proportion 4"$2 = methylation proportion for condition <

Region level: = >$ = ?@$= ∑BCD

E8 FGB$1["CB] + L#FD$

MG: FD$ = 0

Region-level modeling

+ , P , P ∈ R2< , , ,

loci-specific intercept condition effect

8

Page 9: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Generalized Least Squares (GLS) with variance stabilizing transformation:

arcsine link transformation (Park & Wu 2016)!"#$ = &'()*+(2 ⁄/"#$ 0"#$ − 1)

3&'( ⁄/"#$ 0"#$) ∝6"#$ 1 − 6"#$ but 3&'(!"#$) ≈89(:;<=>8 )?;=@

:;<=

A$ = BC$ + E$where F E$ = G and 3&' E$ = H$

ICJ = BKHJ>LB>LHJ>LBKHJ>LAJ

Region-level model fitting

Variance depends on mean Variance independent of mean

9

Page 10: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Account for variability across samples and locations

(1) Correlation: Continuous Autoregressive (CAR) model

! "#$%, "'$% = )*+,|./,*.0,|1#% = genomic location of CpG 2

10

(2) Variability dependent on coverage

345 "#$% ∝ 18#9%

(3) Within sample correlation

Independentsamples :;< "#$%, "#$∗% = 0

Page 11: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Example: highly ranked DMR across all methods

Met

hyla

tion

0.2

0.5

0.8

Left

Vent

ricle

Smal

l Int

estin

e

chr5: 135,027,359 − 135,030,207 (width = 2,849)

dmrseq: qval=0.005BSmooth: area rank=0.001, mean rank=0

DSS: area rank=0, mean rank=0.012Metilene: qval=0

Met

hyla

tion

0.2

0.5

0.8

Rig

ht V

entri

cle

Sigm

oid

Col

on

chr16: 85,551,503 − 85,552,644 (width = 1,142)

dmrseq: qval=0.722BSmooth: area rank=0.383, mean rank=0.029

DSS: area rank=0.118, mean rank=0.185Metilene: qval=0.004

Met

hyla

tion

0.2

0.5

0.8

Left

Vent

ricle

Smal

l Int

estin

e

chr10: 119,671,164 − 119,671,495 (width = 332)

dmrseq: qval=0.51BSmooth: area rank=0.01, mean rank=0.349

DSS: area rank=0.014, mean rank=0.474Metilene: qval=1

(A)

(B)

(C)

1

11Korthauer et al., 2018

Page 12: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Example: dmrseq accounts for sample variability

Met

hyla

tion

0.2

0.5

0.8

Left

Vent

ricle

Smal

l Int

estin

e

chr5: 135,027,359 − 135,030,207 (width = 2,849)

dmrseq: qval=0.005BSmooth: area rank=0.001, mean rank=0

DSS: area rank=0, mean rank=0.012Metilene: qval=0

Met

hyla

tion

0.2

0.5

0.8

Rig

ht V

entri

cle

Sigm

oid

Col

on

chr16: 85,551,503 − 85,552,644 (width = 1,142)

dmrseq: qval=0.722BSmooth: area rank=0.383, mean rank=0.029

DSS: area rank=0.118, mean rank=0.185Metilene: qval=0.004

Met

hyla

tion

0.2

0.5

0.8

Left

Vent

ricle

Smal

l Int

estin

e

chr10: 119,671,164 − 119,671,495 (width = 332)

dmrseq: qval=0.51BSmooth: area rank=0.01, mean rank=0.349

DSS: area rank=0.014, mean rank=0.474Metilene: qval=1

(A)

(B)

(C)

1

12Korthauer et al., 2018

Page 13: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Accurate FDR control in simulation

0.0

0.1

0.2

0.3

0.4

0.0 0.1 0.2 0.3 0.4Specified FDR level

Obs

erve

d FD

R le

vel

(B) FDR control by dmrseq

SimulationD2

D3

0.0

0.1

0.2

0.3

0.4

0.0 0.1 0.2 0.3 0.4Specified FDR level

Obs

erve

d FD

R le

vel

(B) FDR control by dmrseq

SimulationD2

D3

2 vs 2 Comparison

3 vs 3 Comparison

Specified FDR ⍺ level

Obs

erve

d FD

R le

vel

13

Page 14: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

High sensitivity and specificity in simulation

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00FDR

Powe

r

(A) Simulation D2

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00FDR

(B) Simulation D3

MethodBSmooth

dmrseq

DSS

metilene

2 vs 2 Comparison 3 vs 3 Comparison

Observed FDR level

Pow

er

140.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00FDR

Powe

r

(A) Simulation D2

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00FDR

(B) Simulation D3

MethodBSmooth

dmrseq

DSS

metilene0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00FDR

Powe

r

(A) Simulation D2

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00FDR

(B) Simulation D3

MethodBSmooth

dmrseq

DSS

metilene

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00FDR

Powe

r

(A) Simulation D2

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00FDR

(B) Simulation D3

MethodBSmooth

dmrseq

DSS

metilene

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00FDR

Powe

r

(A) Simulation D2

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00FDR

(B) Simulation D3

MethodBSmooth

dmrseq

DSS

metilene

Page 15: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Significant DMRs enriched for biological signal

15

Methylation Increase à

ßEx

pres

sion

Dec

reas

e

Data from Ford et al., 2017, bioRxiv

Page 16: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

16

Significant DMRs enriched for biological signal

Methylation Increase à

ßEx

pres

sion

Dec

reas

e

Data from Ford et al., 2017, bioRxiv

Page 17: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

17

Significant DMRs enriched for biological signal

Methylation Increase à

ßEx

pres

sion

Dec

reas

e

Data from Ford et al., 2017, bioRxiv

Page 18: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

18

Significant DMRs enriched for biological signal

Methylation Increase à

ßEx

pres

sion

Dec

reas

e

Data from Ford et al., 2017, bioRxiv

Page 19: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

19

80%

90%

100%

50100

250

500

1000

1500

2000

3000

4000

5000

DMR Rank

Perc

ent o

f gen

es re

pres

sed

Methoddmrseq

DSS

Increased biological signal in dmrseq DMRs

Data from Ford et al., 2017, bioRxiv

Page 20: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Summary

- dmrseq identifies and prioritizes DMRs from bisulfite sequencing experiments

- Models signal at the region level in order to account for sample and spatial variability

- Achieves accurate False Discovery Rate control by generating a null distribution that pools information across the genome

- Detailed in “Detection and accurate False Discovery Rate control of differentially methylated regions from Whole Genome Bisulfite Sequencing” (Biostatistics, 2018)

- dmrseq R package available on Bioconductor

20

Page 21: De novo detection and accurate inference of differentially ... · aWNT-signalingmediator(Lietal.2011).EZH2 (odds-ratio = 1.59, P < 2.2 3 10!16,presentin66%ofblockboundaries) encodes

Dana-Farber/Harvard ChanRafael IrizarryClaire Duvallet

Stephanie Hicks

Patrick Kimes

Yered Pita-Juarez

Alejandro Reyes

Chinmay Shukla

Mingxiang Teng

CollaboratorsSutirtha ChakrabortyYuval Benjamini

Acknowledgements

Contact

[email protected]

keegankorthauer

kkorthauer.org!