Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization...
Transcript of Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization...
Preprocessing of Microarray Data
Normalization and Statistical Analysis
(Working with Noise...)
Microarray Processing Pipeline
Question/Experimental Design
Sample Preparation/Hybridization
Image Analysis
Array Design/Probe Design
Normalization (Scaling)
Comparable Gene Expression Data
Statistical Analysis
Buy Chip or Array
Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network
Expression Index Calculation
Question/Experimental Design
• Read papers!
• Formulate a detailed hypothesis
• Design experiment accordingly!
• Take our extended 27612 course ☺
Microarray Processing Pipeline
Question/Experimental Design
Sample Preparation/Hybridization
Image Analysis
Array Design/Probe Design
Normalization (Scaling)
Comparable Gene Expression Data
Statistical Analysis
Buy Chip or Array
Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network
Expression Index Calculation
Setup at DTU/CBS—The GCS3000
• >50.000 probe sets pr. chip
• Can use the newest series chips
• Automation of routine procedures– better reproducibility– lighter workload– faster scans
The CBS Lab
Conducts minor experiments/validations
~200 Chips scanned last year by CBS
Microarray Processing Pipeline
Question/Experimental Design
Sample Preparation/Hybridization
Image Analysis
Array Design/Probe Design
Normalization (Scaling)
Comparable Gene Expression Data
Statistical Analysis
Buy Chip or Array
Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network
Expression Index Calculation
Two kinds of variation
Global variation
Amount of RNA in the sample
Efficiencies of:– RNA extraction– Reverse transcription– amplification– Labeling– Photodetection
Systematic
Gene-specific variation
Spotting efficiency,– Spot size– Spot shape
Cross-/unspecific hybridization
Biological variation– RNA degradation– Noise
Stochastic
Gene-specific variation:
Too random to be explicitly accounted for“noise”
Global variation:
Similar effect on many
measurementsCorrections can be estimated from data
Normalization Statistical testing
Sources of variation
Systematic Stochastic
The R Project for Statistical Computing
BioConductor and R
R is flexible and community oriented
BioConductor is a repository for R methods for microarray analysis
Web: http://www.bioconductor.org
The CBS Workhorse
Qspline – Normalization developed at CBS
Qspline performs a high-quality non-linear normalization
The Quantile and Qspline method
From the empirical distribution, a number of quantiles are calculated for each of the channels to be normalized (one channel shown in red) and for the reference distribution (shown in black)A QQ-plot is made and a normalization curve is constructed by fitting a cubic spline functionAs reference one can use an artificial “median array” for a set of arrays or use a log-normal distribution, which is a good approximation.
Visualizing data
MVA plot
Lowess Normalization
One of the most commonly utilized normalization techniques is the LOcally Weighted Scatterplot Smoothing (LOWESS) algorithm.
M
A
* * * *** *
Invariant set normalization (Li and Wong)
A invariant set of probes is usedProbes that does does not change intensity rank between arraysA piecewise linear median line is calculatedThis curve is used for normalization
Comparison of Normalization Algorithms
QsplineLoessLinearInv. Set
(From Workman et al., 2002)
Calibration = Normalization = Scaling
Microarray Processing Pipeline
Question/Experimental Design
Sample Preparation/Hybridization
Image Analysis
Array Design/Probe Design
Normalization (Scaling)
Comparable Gene Expression Data
Statistical Analysis
Buy Chip or Array
Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network
Expression Index Calculation
Expression Index Calculation
Some microarrays have multiple probes addressing the expression of the same gene
- Affymetrix GeneChips have 11-20 probe pairs pr. gene
PM: CGATCAATTGCACTATGTCATTTCTMM: CGATCAATTGCAGTATGTCATTTCT
-Perfect Match (PM)-MisMatch (MM)
Comparison of Methods
Reproducibility of Genetic Expression from 20 Replicates:
Blue and Red RMA; Black Li & Wong dChip; Green MAS 5.0
(By Terry Speed)
Microarray Processing Pipeline
Question/Experimental Design
Sample Preparation/Hybridization
Image Analysis
Array Design/Probe Design
Normalization (Scaling)
Comparable Gene Expression Data
Statistical Analysis
Buy Chip or Array
Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network
Expression Index Calculation
Gene-specific variation:
Too random to be explicitly accounted for“noise”
Global variation:
Similar effect on many
measurementsCorrections can be estimated from data
Normalization Statistical testing
Sources of variation
Systematic Stochastic
Asking questions of your microarray data
• Typically we want to identify differentially expressed genes
• Example: • Alcohol dehydrogenase is expressed at a higher level
when alcohol is added to the media
without alcohol with alcohol
alcohol dehydrogenase
Significance Testing
wildtype+alcohol
wildtype
mutant +alcohol
mutant
Alas, the data contain stochastic noise
He’s going to say it
There is no way around it...
Statistics
You can think of statistics as a black box...
...but, you still need to perform the test and understand how to interpret the results
Noisymeasurements p-value
statistic
What's inside the black box of ‘statistics’?
t-tests, ANOVAs and Volcanoes
The notorious p-value
p-value = The chance of rejecting the null
hypothesis by coincidence----------------------------For gene expression analysis we can
say: The chance that a gene is categorized as
differential expressed by mistake
The t-test
The t statistic is based on the sample meanand variance
Then a lookup of t in a table of the t-distribution finds the p-value
ANOVA = ANalysis Of VAriance
• Very similar to the t-test, but can test multiple categories
• Ex: is gene x differentially expressed between wt, mutant 1 and mutant 2
• Advantage: it has more ‘power’ than the t-test
The two way ANOVA
wildtype+alcohol
wildtype
mutant +alcohol
mutant
2
2
1 13 3
3
The fine print...
We can rank the genesaccording to the p-value
But, we really can’t trust the p-value, in the strict statistical way!
We rarely fulfill all the assumptions of the statistical tests...
And we should take multiple testing into account
The Problem of Multiple Testing
1981• Prince Charles gets married• Liverpool wins the European
Championship League• Pope dies
2005• Prince Charles gets married• Liverpool wins the European
Championship League• Pope dies
The Problem of Multiple Testing
In a typical microarray analysis we test thousands of genes
If we use a significance level of 0.05 and we test 1000 genes. We would expect 50 genes to be significant by chance
1000 x 0.05 = 50
Correction for multiple testing
Bonferroni:
P ≤Confidence level of 99%
0.01N
Benjamini-Hochberg:
P ≤ iN
0.01
N = Number of genesi = Ranking number of gene
But really, those methods are not the answer
What we can trust is the ability of statistic test to rank genes accordingly to their reliability
The number of genes that are needed or can be handled in downstream processes can be used to set the cutoff
If we permute the samples we then can get an estimate of the False Discovery Rate (FDR) in this set
...but then what do we do?
log2 fold change (M)
P-value
Conclusions
Array data contains stochastic noise– Statistics is needed to
conclude on differential expression
We can’t trust the p-value
But the statistics can rank genes– The capacity/needs of
downstream processes can be used to set cutoff
FDR can be estimated– e.g. Volcano Plots
t-test is used for two category tests
ANOVA is used for multiple categories
The “Result”—A Ranked Gene List
Relapse: Cluster Analysis
Clustering of 19 class predictive genes
Sensitivity:– Relapse: 87%– CCR: 74%
Specificity:– Relapse: 69%– CCR: 92%
P30.R
P18.C
P35.C
P52.C
P33.C
P13.C
P57.C
P7.C
P28.C
P37.C
P34.C
P22.C
P19.C
P55.C
P29.C
P5.R
P17.R
P21.R
P9.C
P27.R
P50.C
P53.R
P48.C
P20.R
P24.R
P51.R
P25.R
P47.C
204812_at ZW10
203073_at LDLC
202958_at PTPN9
202772_at HMGCL
218896_s_at HSA277841
204798_at MYB
206586_at CNR2
208594_x_at ILT8
204708_at MAPK4
221746_at UBL4
222140_s_at SH120
217942_at MDS023
202322_s_at GGPS1
222103_at ATF1
202630_at APPBP2
200050_at ZNF146
204327_s_at ZNF202
202535_at FADD
201746_at TP53
-0.48 0.00 0.48