Removing Unwanted Variation from methylation array data… All you need is RUV! - Jovana Maksimovic

1
The power of RUV To assess the performance of RUV, we need to know some “truth” To define “truth”, we performed differential methylation analysis on 450k ageing data (Heyn et al. PNAS, 2012. Jun 26;109(26)10522-7) (20 newborns vs. 20 centenarians) using limma. This is the truth set (Fig. 5a) We then simulated a batch effect in the same data by modifying a random subset of 18 newborn and 4 centenarian samples The batch effect was introduced by scaling the green channel intensities of the selected samples The scaling was performed for every green channel intensity within a sample using values randomly sampled from a normal distribution (mean=10, SD=100). This was repeated for each of the selected samples, creating additional variation within the batch (e.g. sample quality differences), which cannot be modelled. This is the test set (Fig. 5c) We compared the performance of 2 RUV flavours (RUV-inverse, RUV- ridge inverse) to other adjustment methods in a differential methylation analysis of the test set The competing methods were selected as they are routinely used to adjust for batch effects and other unwanted variation in differential analyses Using a receiver operating characteristic (ROC) analysis (Fig. 6), both RUV flavours significantly outperform other methods RUV-inverse with empirical controls performed the best (Fig. 6) True positive rate batch 1 batch 2 Adapted from Lazar C et al. Brief Bioinform 2012 (2) MDS Plot: Batch effects in microarray data Y = M-values = log 2 (Meth. intensity/Unmeth. intensity) X = factors of interest eg. disease state, treatment, etc Z = observed covariates eg. sex, ethnicity W = unobserved unwanted factors (estimated using negative controls) Y mxn =X mxp β mxn +Z mxq γ qxn +W mxk α kxn +ε mxn (3) RUV Linear Model Removing Unwanted Variation from methylation array data… All you need is RUV! Conclusion 1. Simply the best! RUV performs better in a differential methylation analysis of 450k array data than all other methods tested 2. RUV-inverse performs the best using an appropriately defined set of empirical controls 3. The RUV analysis pipeline is easily implemented in R for 450k data Aim 1. Compare the new crazy little thing called RUV to existing methods for removing unwanted variation from 450k array data 2. Devise a “best practice” pipeline for removing unwanted variation from 450k data RUV = Remove Unwanted Variation A family of methods (Gagnon-Bartsch et al., manuscript in preparation) that extend the framework developed for RUV2 (Gagnon-Bartsch & Speed, Biostatistics, 2012. Jul;13(3):539-52) for correcting for unwanted variation The RUV methods use negative control genes/probes to infer unwanted factors from the data Negative controls should not be associated with the factor of interest, but should capture unwanted variation Adjustment is applied at the stage of the differential comparison via the linear modelling framework (Fig. 3) What is RUV? Let’s get it started… Illumina Infinium HumanMethylation450 BeadChip The 50bp Infinium methylation probes query a [C/T] polymorphism created by bisulfite conversion of unmethylated cytosines in the genome. 450k array covers 485,577 CpG sites Infinium I probe type (~25%) (Fig. 1a) Infinium II probe type (~75%) (Fig. 1b) 5’ GC GT GC M U bisulfite conversion methylated unmethylated GC GT CG CA G T C A G T C A 5’ 5’ 5’ 5’ 5’ DNA sample (1a) Infinium I (1b) Infinium II GC GT GC bisulfite conversion methylated unmethylated G[C/T] C 5’ 5’ 5’ DNA sample G A 450k arrays are popular for large differential methylation studies Genome-wide High-throughput Single-nucleotide resolution Relatively inexpensive compared to other methods e.g. sequencing Great for large studies with many samples The good… Large studies are particularly susceptible to unwanted variation Good experimental design can mitigate effects of unwanted variation But, factors causing unwanted variation can be unknown And, unwanted factors can sometimes be the largest source of variation (Fig. 2) In large studies, batch effects are often unavoidable due to limitations on how many samples can be processed at any one time …the bad & the ugly Jovana Maksimovic 1 , Terry Speed 2 , Alicia Oshlack 1 1 Murdoch Childrens Research Institute, RCH, Melbourne, Australia; 2 Walter and Eliza Hall Institute, Melbourne, Australia RUV M-values Illumina negative control probes Ranked list of DM CpGs RUV Empirical controls Ranked list of DM CpGs Step 1 Step 2* (4) RUV differential methylation analysis pipeline Using RUV in 450k differential methylation (DM) analysis RUV relies on negative controls to adjust for unwanted variation Illumina includes 614 negative control probes (neg) on the 450k array for background correction - we can use these as negative controls in RUV DM analysis (Fig. 4, Step 1) as they capture some technical variation between arrays (Fig 5b&d) However, the negsonly produce a background-level signal and cannot capture unwanted biological variation But, we can use the ranked list of CpGs generated by RUV DM analysis with “negs” to empirically identify CpG probes not associated with the factor of interest (e.g. bottom 50% of list) that are a more representative set of negative controls (Fig. 4, Step 2). These are called empirical controls (emp) The results from RUV DM analysis with “emp” can then be used to further refine the set of empirical controls, if necessary (Fig.4, Step 2) That’s the way RUV goes (5a) MDS Plot: Truth set (5b) Truth set: neg. control probes (5c) MDS Plot: Test set (5d) Test set: neg. control probes False positive rate (6) ROC Curve: Performance of RUV vs. other methods False positive rate * This step can be performed zero or more times

Transcript of Removing Unwanted Variation from methylation array data… All you need is RUV! - Jovana Maksimovic

The power of RUV To assess the performance of RUV, we need to know

some “truth”

• To define “truth”, we performed differential methylation analysis on 450k

ageing data (Heyn et al. PNAS, 2012. Jun 26;109(26)10522-7) (20 newborns

vs. 20 centenarians) using limma. This is the truth set (Fig. 5a)

• We then simulated a batch effect in the same data by modifying a

random subset of 18 newborn and 4 centenarian samples

• The batch effect was introduced by scaling the green channel

intensities of the selected samples

• The scaling was performed for every green channel intensity within a

sample using values randomly sampled from a normal distribution (mean=10,

SD=100). This was repeated for each of the selected samples, creating

additional variation within the batch (e.g. sample quality differences),

which cannot be modelled. This is the test set (Fig. 5c)

• We compared the performance of 2 RUV flavours (RUV-inverse, RUV-

ridge inverse) to other adjustment methods in a differential methylation

analysis of the test set

• The competing methods were selected as they are routinely used to

adjust for batch effects and other unwanted variation in differential analyses

• Using a receiver operating characteristic (ROC) analysis (Fig. 6), both

RUV flavours significantly outperform other methods

• RUV-inverse with empirical controls performed the best (Fig. 6)

Tru

e p

osit

ive r

ate

batch 1

batch 2

Adapted from Lazar C et al.

Brief Bioinform 2012

(2) MDS Plot: Batch effects in microarray data

Y = M-values = log2(Meth. intensity/Unmeth.

intensity)

X = factors of interest eg. disease state,

treatment, etc

Z = observed covariates eg. sex, ethnicity

W = unobserved unwanted factors (estimated

using negative controls)

Ymxn=Xmxpβmxn+Zmxqγqxn+Wmxkαkxn+εmxn

(3) RUV Linear Model

Removing Unwanted Variation from methylation

array data… All you need is RUV!

Conclusion 1. Simply the best! RUV performs

better in a differential methylation

analysis of 450k array data than all

other methods tested

2. RUV-inverse performs the best using

an appropriately defined set of empirical

controls

3. The RUV analysis pipeline is easily

implemented in R for 450k data

Aim 1. Compare the new crazy little

thing called RUV to existing

methods for removing unwanted

variation from 450k array data

2. Devise a “best practice” pipeline

for removing unwanted variation

from 450k data

RUV = Remove Unwanted Variation

• A family of methods (Gagnon-Bartsch et al., manuscript in

preparation) that extend the framework developed for RUV2

(Gagnon-Bartsch & Speed, Biostatistics, 2012.

Jul;13(3):539-52) for correcting for unwanted variation

• The RUV methods use negative control genes/probes

to infer unwanted factors from the data

• Negative controls should not be associated with the

factor of interest, but should capture unwanted variation

• Adjustment is applied at the stage of the differential

comparison via the linear modelling framework (Fig. 3)

What is RUV? Let’s get it started… Illumina Infinium HumanMethylation450 BeadChip

• The 50bp Infinium methylation probes query a [C/T] polymorphism

created by bisulfite conversion of unmethylated cytosines in the genome.

• 450k array covers 485,577 CpG sites

• Infinium I probe type (~25%) (Fig. 1a)

• Infinium II probe type (~75%) (Fig. 1b)

5’

GC

GT GC

M U

bisulfite conversion

methylated unmethylated

GC GT CG CA

G

T

C

A

G

T

C

A

5’

5’ 5’

5’ 5’

DNA sample

(1a) Infinium I

(1b) Infinium II

GC

GT GC

bisulfite conversion

methylated unmethylated

G[C/T] C

5’

5’ 5’

DNA sample

G

A

450k arrays are popular for large differential

methylation studies

• Genome-wide

• High-throughput

• Single-nucleotide resolution

• Relatively inexpensive compared to other methods

e.g. sequencing

• Great for large studies with many samples

The good…

Large studies are particularly

susceptible to unwanted variation

• Good experimental design can

mitigate effects of unwanted variation

• But, factors causing unwanted

variation can be unknown

• And, unwanted factors can

sometimes be the largest source of

variation (Fig. 2)

• In large studies, batch effects are

often unavoidable due to limitations on

how many samples can be processed at

any one time

…the bad & the ugly

Jovana Maksimovic1, Terry Speed2, Alicia Oshlack1 1 Murdoch Childrens Research Institute, RCH, Melbourne, Australia; 2 Walter and Eliza Hall Institute, Melbourne, Australia

RUV

M-values

Illumina

negative

control probes

Ranked list

of DM CpGs RUV

Empirical

controls

Ranked list

of DM CpGs

Step 1 Step 2*

(4) RUV differential methylation analysis pipeline

Using RUV in 450k differential methylation (DM) analysis

• RUV relies on negative controls to adjust for unwanted variation

• Illumina includes 614 negative control probes (neg) on the 450k

array for background correction - we can use these as negative

controls in RUV DM analysis (Fig. 4, Step 1) as they capture some

technical variation between arrays (Fig 5b&d)

• However, the “negs” only produce a background-level signal and

cannot capture unwanted biological variation

• But, we can use the ranked list of CpGs generated by RUV DM

analysis with “negs” to empirically identify CpG probes not

associated with the factor of interest (e.g. bottom 50% of list) that are

a more representative set of negative controls (Fig. 4, Step 2).

These are called empirical controls (emp)

• The results from RUV DM analysis with “emp” can then be used to

further refine the set of empirical controls, if necessary (Fig.4, Step 2)

That’s the way RUV goes

(5a) MDS Plot: Truth set (5b) Truth set: neg. control probes

(5c) MDS Plot: Test set (5d) Test set: neg. control probes

False positive rate

(6) ROC Curve: Performance of RUV vs. other methods

False positive rate

* This step can be performed zero or more times