Post on 03-Jan-2016
Design of microarray gene expression profiling experiments
Peter-Bram ’t Hoen
2
Lay-out
• Practical considerations
• Pooling
• Randomization
• One-color vs Two-colors
• Two-color hybridization designs
• Ratio-based vs Intensity-based analysis
3
Think before you start
• research question
• choice of technology
• controls and replicates
Ref: Churchill. 2002. Nature Genetics Supplement 32: 490-495
4
Research question
• Limit your (initial) number of question / conditions
• choose best timepoint for mRNA regulation
• can be different from protein/activity
• pilots using RT-qPCR
• experimental follow-up
• what will you do with the data?
• verification of differential gene expression
• in vitro experiments to study mechanism
• "in vivo" verification in tissue sections
5
Choice of technology
• What is affordable?
• Do a pilot to estimate the variance for your samples,
experimental set-up and platform
• Calculate your power: What is the lower border of the effect
size that you can pick up?
6
Controls
• positive: genes whose regulation is known
• check on biological experiment & data analysis
• positive: spikes in mRNA and/or hyb mix
• check labeling procedure and hybridization
• detection range (sensitivity) and dynamic range
• "landing lights" for gridding software
• negative controls: non-specific binding
• check cross-hybridization: buffer, non-homologous DNA
7
Spikes
RCA Cab rbcL LTP4 LTP6
Spiked 2-fold change(copies/cell)
21
105
6030
100 50
300150
XCP2 RPC1 NAC1 TIM PRK31
155
6020
150 50
300100
Spiked 3-fold change(copies/cell)
TestRNA
ReferenceRNA
spike
………………………………
………………………………
………………………………
………………………………
………………………………
………………………………
………………………………
………………………………
………………………………
………………………………
………………………………
…… …… …… …… …… ……Array
containingDNA controls
………………………………
cDNA probe synth. & hybridize
8
Spikes
Van de Peppel et al. EMBO Reports 4, 387 (2003)
9
Controls
• positive: genes whose regulation is known
• check on biological experiment & data analysis
• positive: spikes in mRNA and/or hyb mix
• check labeling procedure and hybridization
• detection range (sensitivity) and dynamic range
• "landing lights" for gridding software
• negative controls: non-specific binding
• check cross-hybridization: buffer, non-homologous DNA
10
Replicates
• Include sufficient replicates, based on pilot experiment
• Biological replicates are preferred over technical replicates
• Control experimental variables with possible unintended
effects
• genetic background
• gender
• age
11
Randomization
• Randomize samples with respect to experimental influences
• experimenter
• day of hybridization
• batch of arrays
• dye
• etc
12
Pooling
• Often done because of lack of sufficient amounts of RNA, but
good amplification protocols are available
• Advantages:
• dampening of individual variation, may increase statistical power
• Generally not recommended:
• outliers in the population may result in large and significant
effects
• information on the differences in the population is lost and is
probably biologically relevant
• in fact, it is an artificial way to increase the significance of your
findings
13
Hybridization design
• One color: not many difficulties expected
• Two color: what to hybridize with what in which color?
• Reference design
• Paired design
• Loop design
• Mixed design
Read: Yang & Speed (2002). Design issues for cDNA
microarray experiments. Nature Reviews Genetics 3, 579-588
14
Hybridization design: general issues
• Comparisons on the same array are more precise than
comparisons on different arrays
• Identify most important comparisons
• Hybridize those on the same slide
• Dye swap
• A dye-effect is always there
• Balance designs with respect to dye (exception: some common
reference designs)
15
Common reference vs direct hybridizations
• Direct
• Common reference
AA BB
AA
BB
RR
Variance[ log(A/B) ] for slide = sVariance[ log(A/B) ] for slide = s22
then the variance of the then the variance of the averageaverage of the of the twotwo measurements is measurements is
ss22 /2 /2
log(A/B) = log(A/R) – log(B/R)
and variance of log(A/B) is
variance[ log(A/R) ] + variance[ log(B/R) ]
= s2 + s2 = 2 s2
16
More samples
• Loop Reference
6 arrays
AA
BB RR
CC
AA
BB
CC
Log (A/B) = Log (A/B) = 2/32/3 log (A/B) + log (A/B) + 1/31/3 {log (A/C) – log (B/C)} {log (A/C) – log (B/C)}
Assuming that all variances are equalAssuming that all variances are equalVariance [ log(A/B) ] = Variance [ log(A/B) ] = 4/94/9 (s (s2 2 / 2) + / 2) + 1/91/9 (s (s22) = ) = 1/31/3 s s22
Variance [ log(A/B) ] = Variance [ log(A/B) ] = Variance [ log(A/C) ] = Variance [ log(A/C) ] = Variance [ log(B/C) ] = Variance [ log(B/C) ] = 0.5s0.5s22 + 0.5s + 0.5s22 = s = s22
17
Common reference vs direct hybridizations
Theoretical Considerations
• A design is optimal when it minimizes the variance of the effect of interest
• Look for designs leading to small variance of log(A/B)
Practical considerations
• Common reference may be desired when experiment is extended in the future or when a lot of different conditions have to be compared
• Choose a biologically relevant common reference (say: your control sample). In that case, your ratios are of interest and better interpretable
18
Time-course designs
Take 4 time points
T1 T2 T3 T4
The best choice of design depends on the comparisons of
interest and on the number of slides available
19
Time-course designs
Using 3 slides:
T1 T2 T3 T4
which is the best to estimate changes relative to the initial
time point: T2 / T1, T3 / T1, T4 / T1
20
Time-course designs
• Using 3 slides:
T1 T2 T3 T4
which is the best to estimate relative changes between
successive time points: T2 / T1, T3 / T2, T4 / T3
21
Time course designs
• Using 4 slides:
T1 T2 T3 T4
R
which is the reference design;
All comparisons have equal precision
22
Time course design
• Using 4 slides:
T1 T2 T3 T4
which is the loop design, balanced wrt dye
Distant comparisons have lower precision
23
Time course designs
• Using 4 slides:
T1 T2 T3 T4
also uses exactly 2 hybridizations per treatment,
balanced wrt dye.
Most precise estimates: 1/2, 1/3, 2/4, 3/4
24
Factorial designs
• Designs for studies which involve factors as explanatory
variables
• Age group
• gender
• Cell line
• Tumor types
25
Factorial designs
Glonek & Solomon (2004)
• Admissible design: using the same number of arrays, there
are no other designs yielding smaller variances of all
parameters
Glonek et al.Biostatistics 5, 89-111 (2004)
26
Factorial design; example
• Time
• 0h
• 24h
• Cell lines
• I (non-leukaemic)
• II (leukaemic)
• Find genes diff. expressed at 24 but not at 0: interaction
between time and cell line
27
Factorial design; possible samples
• All combinations of factor levels. In this case, 4 are possible:
Time0 24
cell line I I,0 I,24II II,0 II,24
28
Factorial design: analysis model
• (log-)linear model is used
• experimental conditions correspond to parameter
combinations as in:
I,0 II,0 I,24 II,24
29
Factorial design; possible arrays
I,0 I,24
II,0 II,24
(1)
(2)
(3)
(4)
(5)
(6)
30
Optimal admissible design
• Designs that are not worse than others, and for which the
variance of the parameter of interest is (one of the) smallest
• In the example: wish to find admissible designs for which the
interaction term has one of the smallest variances
31 Glonek et al.Biostatistics 5, 89-111 (2004)
32
Optimal admissible design
Glonek et al.Biostatistics 5, 89-111 (2004)
33
Factorial designs: conclusions
• Design with all pairwise comparisons is not the best in this
case
• Best design can only be found with respect to a model
• if model does not fit the data well, design choice may not be the
best
• make sure model chosen is adequate
34
How to compare efficiently many different conditions?
• Common reference: not efficient
• Loop and mixed designs: not all
comparisons have equal precisions
GA Churchill, Nat Genet. 2002 Dec;32 Suppl:490-5
35
Possible solution
• Randomized design
• Intensity-based rather than ratio-based
calculations
Requires:• Hybridization of two samples independent; no competition for binding sites• Absence of large spot and array effects
To be tested for each platform
36
Our favourite platform
• Spotted collection of 65-mer oligonucleotides (Sigma-
Compugen collection)
• 22K
37
Design used to demonstrate independent hyb
‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
38
Distribution of signal intensities is similar
‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
39
Correlation of intensities is high
‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
R > 0.950.90 < R < 0.95R < 0.90
40
Effect of addition of unlabelled target
Single target on microarray
Tw
o ta
rget
s o
n m
icro
arr
ay
‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
41
Correlation of ratios calculated from different hyb designs
‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
42
Intensity-based analysis
• Hybridizations of two targets on the array are independent
• No saturation and no competition
• Intensity readings show high inter-array correlation
• Comparisons on the same array have highest precision and
all other comparisons have equal precision
‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
43
Example of randomized design
Turk et al. FASEB J 20, 127-129 (2006)
• Mouse models for muscular dystrophy
Histopathological parameters at 8 weeks
Disease Model Affected gene Age of onset Skeletal dystrophy Inflammation Central nuclei DGC loss SGC loss Reference
DMD mdx Dystrophin 2-3 wks Severe + + y y [43]
DMD mdx3cv Dystrophin 2-3 wks Severe + + y y [25]
LGMD2D Sgca-null alpha-Sarcoglycan 1 wk Severe + + n highly reduced [26]
LGMD2E Sgcb-null beta-Sarcoglycan at least 4 wks Severe + + n y [27,28]
LGMD2C Sgcg-null gamma-Sarcoglycan 2 wks Severe + + n highly reduced [29]
LGMD2F Sgcd-null delta-Sarcoglycan 2 wks Severe + + n highly reduced [30]
LGMD2B Dysf-null Dysferlin 8 wks Mild/moderate - +/- n n [11]
LGMD2B SjlDysfDysferlin 3 wks Mild/moderate n/a +/- n/a n/a [24]
not known Sspn-null Sarcospan None None - - n n [31]
Histopathological parameters at 8 weeks
Disease Model Affected gene Age of onset Skeletal dystrophy Inflammation Central nuclei DGC loss SGC loss Reference
DMD mdx Dystrophin 2-3 wks Severe + + y y [43]
DMD mdx3cv Dystrophin 2-3 wks Severe + + y y [25]
LGMD2D Sgca-null alpha-Sarcoglycan 1 wk Severe + + n highly reduced [26]
LGMD2E Sgcb-null beta-Sarcoglycan at least 4 wks Severe + + n y [27,28]
LGMD2C Sgcg-null gamma-Sarcoglycan 2 wks Severe + + n highly reduced [29]
LGMD2F Sgcd-null delta-Sarcoglycan 2 wks Severe + + n highly reduced [30]
LGMD2B Dysf-null Dysferlin 8 wks Mild/moderate - +/- n n [11]
LGMD2B SjlDysfDysferlin 3 wks Mild/moderate n/a +/- n/a n/a [24]
not known Sspn-null Sarcospan None None - - n n [31]
44
Our design
• Randomly assign samples to
the arrays, avoiding co-
hybridization of sample from
the same group
• 2 biological replicates
• 4 technical replicates (dye-
swap + replicate spotting)
HYB ID Cy3 Cy5
Model Individual Model IndividualMD1 3cv B Bl10 B
MD2 Sgcb2 B Mdx B
MD3 Bl10 A Sgcb2 A
MD4 Sspn B 3cv A
MD5 Bl6 B Bl10 A
MD6 Sgcb A dysf A
MD7 Sgcd B Sgcb A
MD8 Sgcg A Sspn A
MD9 Mdx A 3cv B
MD10 Bl6 A Mdx A
MD11 Sjl B Bl6 A
MD12 Sgca A Bl6 B
MD13 3cv A hDMD B
MD14 Mdx B Sgca A
MD15 Bl10 B Sgcg B
MD16 Sspn A dysf B
MD17 Sjl A Sgca B
MD18 Sgcb2 A Sgcd B
MD19 dysf A Sgcg A
MD20 Sgcd A Sgcb2 B
MD21 Sgca B Sgcd A
MD22 dysf B hDMD A
MD23 Sgcg B Sjl A
MD24 Sgcb B Sspn B
MD25 hDMD A Sgcb B
MD26 hDMD B Sjl B
Turk et al. FASEB J 20, 127-129 (2006)
45
Intensity-based analysis can go wrong
Vinciotti et al. Bioinformatics 21:492-501 (2005)
46
Intensity-based analysis can go wrong
Vinciotti et al. Bioinformatics 21:492-501 (2005)
47
Some guidelines
• First determine the main question, pointing out the effect of
interest
• log[A/B]
• Then choose analysis model, so that effect variance can
be computed
• VAR { log[A/B] }
• Practical constraints: amount of RNA available, number of
hybridizations, number of slides
• A good design measures the effect of interest as
accurately as possible
• small VAR { log[A/B] }
48
Some useful links
• http://dial.liacs.nl/Courses/CMSB%20Courses.html
• http://www.brc.dcs.gla.ac.uk/~rb106x/microarray_tips.htm
• http://exgen.ma.umist.ac.uk/course/notes/WitDesignLecture.pdf
• http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp
49
Acknowledgements
Human and Clinical Genetics, LUMCJudith BoerRenée de MenezesRolf TurkEllen SterrenburgJohan den DunnenGertjan van Ommen
Microarray facility:Leiden Genome Technology Center
50
Case study
• Two genetically-modified zebrafish strains and one wild-type
• Defects mainly in muscle development
• Apparent at 12-48 hours of development; early death
• Question: which biological pathways are affected and
responsible for defective myogenesis?
51
Possible platforms and budget
• Affymetrix (1-color): 500 euro per chip;
• variance for ratio of two samples on two chips: s2
• Homespotted arrays (2-color): 100 euro per chip
• variance for ratio of two samples on one chip: 2s2
• Budget: 12,000 euro
52
Questions
• Isolation of specific compartments / whole animal lysates?
• Pooling?
• How many replicates?
• Which hybridization design?
• What is the variance of the most important comparisons?