Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo,...
Transcript of Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo,...
![Page 1: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/1.jpg)
Design of Experiments
• Problem formulation
• Setting up the experiment
• Analysis of data
Panu Somervuo, March 20, 2007
![Page 2: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/2.jpg)
Problem formulation
• what is the biological question?• how to answer that?• what is already known?• what information is missing?• problem formulation model of the biological system
![Page 3: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/3.jpg)
Setting up an experiment
• what kind of data is needed to answer the question?• how to collect the data?• how much data is needed?• biological and technical replicates• pooling• how to carry out the experiment (sample preparation,
measurements)? ControlControl
TestTest
![Page 4: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/4.jpg)
Analysis of data
• preprocessing• filtering & outlier removal• normalization• statistical model fitting• hypothesis testing• reporting the results, documentation
![Page 5: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/5.jpg)
Everything depends on everything
problem formulationmodel of the system
setting up the experimentnumber of samples
analysis of datastatistical tests
![Page 6: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/6.jpg)
Practical guidelines• blocking unwanted effects (e.g. dye effect)
• randomization (avoid systematic bias by randomizing e.g. the order of sample preparations)
• replication (replicate measurements can be averaged to reduce the effect of random errors)
group1 group2
cy3
cy5
group1 group2
cy5 cy5
cy3 cy3
![Page 7: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/7.jpg)
ControlControl
TestTest
y = µ+F1+F2+...+errorlog transform
,
normalization
![Page 8: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/8.jpg)
Pairwise sample comparison vs modeling
• pairwise sample comparison is easy and straightforward
• instead of comparing samples as such, we can construct a model for the measurements and then perform comparisons
ControlControl
TestTest
![Page 9: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/9.jpg)
Mathematical model of data
• try to capture the essence of a (biological) phenomenon in mathematical terms
• here we concentrate on linear models: observation consists of effects of one or more factors and random error
• factor may have several levels (e.g. factor sex has two levels, male and female)
![Page 10: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/10.jpg)
Examples of models
• single factor: y = µ + gene + error • two factors: y = µ + treatment + gene + error
• two factors including interaction term: y = µ + treatment + gene + treatment.gene + error • four factors: y = µ + treatment + gene + dye + array + error
normalization, log transform
![Page 11: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/11.jpg)
From model to experimental design y = µ + drug + sex + drug.sex + error
factor 1, drug: 3 levelsfactor 2, sex: 2 levels3x2 factorial design:
M F
no treatment y111, y112, y113, y114
y121, y122, y123, y124
treatment A y211, y212, y213, y214
y221, y222, y223, y224
treatment B y311, y312, y313, y314
y321, y322, y323, y324
![Page 12: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/12.jpg)
Analysis of variance
• ANOVA can be used to analyse factorial designsy = µ + drug + sex + drug.sex + error
M F
no treatment 1.0, 1.1, 0.9, 1.3
0.7, 0.5, 0.6, 0.8
treatment A 1.1, 1.2, 0.8, 1.3
0.7, 0.8, 0.6, 0.9
treatment B 2.1, 1.9, 1.7, 2.0
1.5, 1.3, 1.4, 1.1
summary(aov(y~drug*sex,data=data))
Df Sum Sq Mean Sq F value Pr(>F) drug 2 2.86750 1.43375 51.3582 3.644e-08 ***sex 1 1.26042 1.26042 45.1493 2.673e-06 ***drug:sex 2 0.06583 0.03292 1.1791 0.3302 Residuals 18 0.50250 0.02792 ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
![Page 13: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/13.jpg)
Multiple pairwise comparisons
• ANOVA tells that at least one drug treatment has effect, but in order to find which one we perform all pairwise comparisons:
M F
no treatment 1.0, 1.1, 0.9, 1.3
0.7, 0.5, 0.6, 0.8
treatment A 1.1, 1.2, 0.8, 1.3
0.7, 0.8, 0.6, 0.9
treatment B 2.1, 1.9, 1.7, 2.0
1.5, 1.3, 1.4, 1.1
TukeyHSD(aov(y~drug*sex,data=data,"drug")
Tukey multiple comparisons of means 95% family-wise confidence level factor levels have been ordered
Fit: aov(formula = y ~ drug * sex, data = data)
$drug diff lwr uprA-0 0.0625 -0.1507113 0.2757113B-0 0.7625 0.5492887 0.9757113B-A 0.7000 0.4867887 0.9132113
![Page 14: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/14.jpg)
Benefits of (good) models• after fitting the model with data, model can be used to answer the questions
e.g.:– is there dye effect?– is the difference of gene expression levels in two conditions statistically
significant?– is there interaction between gene and another factor?
• simple pairwise sample comparisons cannot give answers to all of these questions simultaneously
ControlControl
TestTest
y=µ+F1+F2+...+error
![Page 15: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/15.jpg)
What is a good model?
• good model allows us to get more detailed results • best model and parametrization is application specific• simple vs complex model
y=µ+F1+F2+F3+...+error• there should be balance between model complexity and
the amount of data
dye1 dye2
control y111, y112, y113
y121, y122, y123
treatment A y211, y212, y213
y221, y222, y223
treatment B y311, y312, y313
y321, y322, y323
![Page 16: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/16.jpg)
How the number of samples affects the confidence of our results?
• measurement error is always present, see the example self-self hybridization:
![Page 17: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/17.jpg)
How the number of samples affects the confidence of our results?
• let’s compute the mean average of expression level of a gene• how accurate is this value?• variance(mean) = variance(error)/number of samples• samples from normal distribution (mean 0, sd 1):
![Page 18: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/18.jpg)
Theoretical sample size calculations
• for each statistical test, there is a (test-specific) relation between:– power of a test: 1 – probability(type I error)– significance level: probability(type II error)– error variance– mean difference needed to be detected– number of samples
![Page 19: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/19.jpg)
actual situation
drug has effect
actual situation
drug has no effect
our conclusion
drug has effect
correct conlusion
true positive
probability
type I error
false positive
probability
our conclusion
drug has no effect
type II error
false negative
probability
correct conclusion
true negative
probability
![Page 20: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/20.jpg)
How many samples are needed to detect sample mean difference of 1 unit ?
R function power.t.test:
> power.t.test(delta=1,power=0.95,sd=1,sig.level=0.05)
Two-sample t test power calculation
n = 26.98922 delta = 1 sd = 1 sig.level = 0.05 power = 0.95 alternative = two.sided
NOTE: n is number in *each* group
![Page 21: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/21.jpg)
What is the power of test when using 10 samples ?
R function power.t.test:
> power.t.test(n=10,delta=1,sd=1,sig.level=0.05)
Two-sample t test power calculation
n = 10 delta = 1 sd = 1 sig.level = 0.05 power = 0.5619846 alternative = two.sided
NOTE: n is number in *each* group
![Page 22: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/22.jpg)
How small difference between sample means we are able to detect using 10 samples ?
R function power.t.test:
> power.t.test(n=10,power=0.95,sd=1,sig.level=0.05)
Two-sample t test power calculation
n = 10 delta = 1.706224 sd = 1 sig.level = 0.05 power = 0.95 alternative = two.sided
NOTE: n is number in *each* group
![Page 23: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/23.jpg)
Two kinds of replicates
• biological replicates: biological variability
• technical replicates: measurement accuracy
• most statistical programs assume independent samples
A3 A2 A1 B3 B2B1 C3C2C1 D3 D2 D1
![Page 24: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/24.jpg)
Pooling
A1
B1
A2
B2
A3
B3
A1A2
A3
B1
B2
B3
![Page 25: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/25.jpg)
Pooling
• ok when the interest is not on the individual, but on common patterns across individuals (population characteristics)
• results in averaging reduces variability substantive features are easier to find
• recommended when fewer than 3 arrays are used in each condition• beneficial when many subjects are pooled• one pool vs independent samples in multiple pools
C. Kendziorski, R. A. Irizarry, K.-S. Chen, J. D. Haag, and M. N. Gould,"On the utility of pooling biological samples in microarray experiments",PNAS March 2005, 102(12) 4252-4257
inference for most genes was not affected by pooling
![Page 26: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/26.jpg)
How to allocate the samples to microarrays?
• which samples should be hybridized on the same slide?• different experimental designs• reference design, loop design• what is the optimal design?
A
B
C
D
![Page 27: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/27.jpg)
Example of four-array experiment
array cy3 cy5 log(cy5/cy3)
1 A B log(B) – log(A)
2 A B log(B) – log(A)
3 B A log(A) – log(B)
4 B A log(A) – log(B)
B
A
1 2 3 4
cy5
cy3cy5
cy3
![Page 28: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/28.jpg)
Reference design
Ref
A
B
C
D
array cy3 cy5 log(cy5/cy3)
1 Ref A log(A) – log(Ref)
2 Ref B log(B) – log(Ref)
3 Ref C log(C) – log(Ref)
4 Ref D log(D) – log(Ref)
1
2
3
4
log(C/A) = log(C) - log(A) = log(C) - log(Ref) + log(Ref) - log(A)
= log(C) - log(Ref) – (log(A) - log(Ref)) = logratio(array3) - logratio(array1)
![Page 29: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/29.jpg)
Loop design
A
B
C
D
array cy3 cy5 log(cy5/cy3)
1 A B log(B) – log(A)
2 B C log(C) – log(B)
3 C D log(D) – log(C)
4 D A log(A) – log(D)
1
23
4
log(C/A) = log(C) – log(B) + log(B) – log(A)= logratio(array2) + logratio(array1)
log(C/A) = log(C) – log(D) + log(D) – log(A) = - logratio(array3) - logratio(array4)
log(C/A)=(logratio1 + logratio2)/2
![Page 30: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/30.jpg)
Comparing the designs
Ref
A
B
C
A
B
C
Ref
A
B
C
reference design reference design with replicates
loop design
number of arrays 3 6 3
amount of RNA required per sample
1+Ref 2+Ref 2
error 2.0 1.0 0.67
![Page 31: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/31.jpg)
Design with all direct pairwise comparisons
3
5
6
1
2
4
![Page 32: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/32.jpg)
Parental - stressedParental - stressed
Parental - unstressedParental - unstressed
Derived - stressedDerived - stressed
Derived - unstressedDerived - unstressed
EnvironmentEnvironment
GenotypeGenotype
Example: examining genotype, phenotype, and environment
Reference SampleReference Sample
Assay VariationAssay Variation
![Page 33: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/33.jpg)
Optimal design• maximize the accuracy of parameters of interest• procedure: enumerate all possible designs, calculate
the parameter accuracy for each of them and select the best design
• optimal design is model specific
![Page 34: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/34.jpg)
![Page 35: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/35.jpg)
About the nature of microarray data
• Microarray data can give hypothesis to be tested further• Results from microarray analysis should be cerified by
other means (qPCR,...)• quality of microarray data depends on samples, probes,
hybridization, lab work• data pre-processing, normalization, and outlier detection
are as important as good experimental design
![Page 36: Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eef5503460f94bff663/html5/thumbnails/36.jpg)
More about statistics
• M.J. Crawley: ”Statistics – An Introduction using R”, John Wiley&Sons, 2005
• S.A. Glantz: ”Primer of Biostatistics”, McGraw-Hill, 5th ed., 2002
• D.C. Montgomery: ”Design and Analysis of Experiments”, John Wiley&Sons, 5th ed. 2001