Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization...

Post on 09-Jun-2020

7 views 0 download

Transcript of Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization...

Preprocessing of Microarray Data

Normalization and Statistical Analysis

(Working with Noise...)

Microarray Processing Pipeline

Question/Experimental Design

Sample Preparation/Hybridization

Image Analysis

Array Design/Probe Design

Normalization (Scaling)

Comparable Gene Expression Data

Statistical Analysis

Buy Chip or Array

Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network

Expression Index Calculation

Question/Experimental Design

• Read papers!

• Formulate a detailed hypothesis

• Design experiment accordingly!

• Take our extended 27612 course ☺

Microarray Processing Pipeline

Question/Experimental Design

Sample Preparation/Hybridization

Image Analysis

Array Design/Probe Design

Normalization (Scaling)

Comparable Gene Expression Data

Statistical Analysis

Buy Chip or Array

Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network

Expression Index Calculation

Setup at DTU/CBS—The GCS3000

• >50.000 probe sets pr. chip

• Can use the newest series chips

• Automation of routine procedures– better reproducibility– lighter workload– faster scans

The CBS Lab

Conducts minor experiments/validations

~200 Chips scanned last year by CBS

Microarray Processing Pipeline

Question/Experimental Design

Sample Preparation/Hybridization

Image Analysis

Array Design/Probe Design

Normalization (Scaling)

Comparable Gene Expression Data

Statistical Analysis

Buy Chip or Array

Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network

Expression Index Calculation

Two kinds of variation

Global variation

Amount of RNA in the sample

Efficiencies of:– RNA extraction– Reverse transcription– amplification– Labeling– Photodetection

Systematic

Gene-specific variation

Spotting efficiency,– Spot size– Spot shape

Cross-/unspecific hybridization

Biological variation– RNA degradation– Noise

Stochastic

Gene-specific variation:

Too random to be explicitly accounted for“noise”

Global variation:

Similar effect on many

measurementsCorrections can be estimated from data

Normalization Statistical testing

Sources of variation

Systematic Stochastic

The R Project for Statistical Computing

BioConductor and R

R is flexible and community oriented

BioConductor is a repository for R methods for microarray analysis

Web: http://www.bioconductor.org

The CBS Workhorse

Qspline – Normalization developed at CBS

Qspline performs a high-quality non-linear normalization

The Quantile and Qspline method

From the empirical distribution, a number of quantiles are calculated for each of the channels to be normalized (one channel shown in red) and for the reference distribution (shown in black)A QQ-plot is made and a normalization curve is constructed by fitting a cubic spline functionAs reference one can use an artificial “median array” for a set of arrays or use a log-normal distribution, which is a good approximation.

Visualizing data

MVA plot

Lowess Normalization

One of the most commonly utilized normalization techniques is the LOcally Weighted Scatterplot Smoothing (LOWESS) algorithm.

M

A

* * * *** *

Invariant set normalization (Li and Wong)

A invariant set of probes is usedProbes that does does not change intensity rank between arraysA piecewise linear median line is calculatedThis curve is used for normalization

Comparison of Normalization Algorithms

QsplineLoessLinearInv. Set

(From Workman et al., 2002)

Calibration = Normalization = Scaling

Microarray Processing Pipeline

Question/Experimental Design

Sample Preparation/Hybridization

Image Analysis

Array Design/Probe Design

Normalization (Scaling)

Comparable Gene Expression Data

Statistical Analysis

Buy Chip or Array

Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network

Expression Index Calculation

Expression Index Calculation

Some microarrays have multiple probes addressing the expression of the same gene

- Affymetrix GeneChips have 11-20 probe pairs pr. gene

PM: CGATCAATTGCACTATGTCATTTCTMM: CGATCAATTGCAGTATGTCATTTCT

-Perfect Match (PM)-MisMatch (MM)

Comparison of Methods

Reproducibility of Genetic Expression from 20 Replicates:

Blue and Red RMA; Black Li & Wong dChip; Green MAS 5.0

(By Terry Speed)

Microarray Processing Pipeline

Question/Experimental Design

Sample Preparation/Hybridization

Image Analysis

Array Design/Probe Design

Normalization (Scaling)

Comparable Gene Expression Data

Statistical Analysis

Buy Chip or Array

Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network

Expression Index Calculation

Gene-specific variation:

Too random to be explicitly accounted for“noise”

Global variation:

Similar effect on many

measurementsCorrections can be estimated from data

Normalization Statistical testing

Sources of variation

Systematic Stochastic

Asking questions of your microarray data

• Typically we want to identify differentially expressed genes

• Example: • Alcohol dehydrogenase is expressed at a higher level

when alcohol is added to the media

without alcohol with alcohol

alcohol dehydrogenase

Significance Testing

wildtype+alcohol

wildtype

mutant +alcohol

mutant

Alas, the data contain stochastic noise

He’s going to say it

There is no way around it...

Statistics

You can think of statistics as a black box...

...but, you still need to perform the test and understand how to interpret the results

Noisymeasurements p-value

statistic

What's inside the black box of ‘statistics’?

t-tests, ANOVAs and Volcanoes

The notorious p-value

p-value = The chance of rejecting the null

hypothesis by coincidence----------------------------For gene expression analysis we can

say: The chance that a gene is categorized as

differential expressed by mistake

The t-test

The t statistic is based on the sample meanand variance

Then a lookup of t in a table of the t-distribution finds the p-value

ANOVA = ANalysis Of VAriance

• Very similar to the t-test, but can test multiple categories

• Ex: is gene x differentially expressed between wt, mutant 1 and mutant 2

• Advantage: it has more ‘power’ than the t-test

The two way ANOVA

wildtype+alcohol

wildtype

mutant +alcohol

mutant

2

2

1 13 3

3

The fine print...

We can rank the genesaccording to the p-value

But, we really can’t trust the p-value, in the strict statistical way!

We rarely fulfill all the assumptions of the statistical tests...

And we should take multiple testing into account

The Problem of Multiple Testing

1981• Prince Charles gets married• Liverpool wins the European

Championship League• Pope dies

2005• Prince Charles gets married• Liverpool wins the European

Championship League• Pope dies

The Problem of Multiple Testing

In a typical microarray analysis we test thousands of genes

If we use a significance level of 0.05 and we test 1000 genes. We would expect 50 genes to be significant by chance

1000 x 0.05 = 50

Correction for multiple testing

Bonferroni:

P ≤Confidence level of 99%

0.01N

Benjamini-Hochberg:

P ≤ iN

0.01

N = Number of genesi = Ranking number of gene

But really, those methods are not the answer

What we can trust is the ability of statistic test to rank genes accordingly to their reliability

The number of genes that are needed or can be handled in downstream processes can be used to set the cutoff

If we permute the samples we then can get an estimate of the False Discovery Rate (FDR) in this set

...but then what do we do?

log2 fold change (M)

P-value

Conclusions

Array data contains stochastic noise– Statistics is needed to

conclude on differential expression

We can’t trust the p-value

But the statistics can rank genes– The capacity/needs of

downstream processes can be used to set cutoff

FDR can be estimated– e.g. Volcano Plots

t-test is used for two category tests

ANOVA is used for multiple categories

The “Result”—A Ranked Gene List

Relapse: Cluster Analysis

Clustering of 19 class predictive genes

Sensitivity:– Relapse: 87%– CCR: 74%

Specificity:– Relapse: 69%– CCR: 92%

P30.R

P18.C

P35.C

P52.C

P33.C

P13.C

P57.C

P7.C

P28.C

P37.C

P34.C

P22.C

P19.C

P55.C

P29.C

P5.R

P17.R

P21.R

P9.C

P27.R

P50.C

P53.R

P48.C

P20.R

P24.R

P51.R

P25.R

P47.C

204812_at ZW10

203073_at LDLC

202958_at PTPN9

202772_at HMGCL

218896_s_at HSA277841

204798_at MYB

206586_at CNR2

208594_x_at ILT8

204708_at MAPK4

221746_at UBL4

222140_s_at SH120

217942_at MDS023

202322_s_at GGPS1

222103_at ATF1

202630_at APPBP2

200050_at ZNF146

204327_s_at ZNF202

202535_at FADD

201746_at TP53

-0.48 0.00 0.48