Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization...

Preprocessing of Microarray Data

Normalization and Statistical Analysis

(Working with Noise...)

Microarray Processing Pipeline

Question/Experimental Design

Sample Preparation/Hybridization

Image Analysis

Array Design/Probe Design

Normalization (Scaling)

Comparable Gene Expression Data

Statistical Analysis

Buy Chip or Array

Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network

Expression Index Calculation

• Read papers!

• Formulate a detailed hypothesis

• Design experiment accordingly!

• Take our extended 27612 course ☺

Image Analysis

Buy Chip or Array

Setup at DTU/CBS—The GCS3000

• >50.000 probe sets pr. chip

• Can use the newest series chips

• Automation of routine procedures– better reproducibility– lighter workload– faster scans

The CBS Lab

Conducts minor experiments/validations

~200 Chips scanned last year by CBS

Image Analysis

Buy Chip or Array

Two kinds of variation

Global variation

Amount of RNA in the sample

Efficiencies of:– RNA extraction– Reverse transcription– amplification– Labeling– Photodetection

Systematic

Gene-specific variation

Spotting efficiency,– Spot size– Spot shape

Cross-/unspecific hybridization

Biological variation– RNA degradation– Noise

Stochastic

Gene-specific variation:

Too random to be explicitly accounted for“noise”

Global variation:

Similar effect on many

measurementsCorrections can be estimated from data

Normalization Statistical testing

Sources of variation

Systematic Stochastic

The R Project for Statistical Computing

BioConductor and R

R is flexible and community oriented

BioConductor is a repository for R methods for microarray analysis

Web: http://www.bioconductor.org

The CBS Workhorse

Qspline – Normalization developed at CBS

Qspline performs a high-quality non-linear normalization

The Quantile and Qspline method

From the empirical distribution, a number of quantiles are calculated for each of the channels to be normalized (one channel shown in red) and for the reference distribution (shown in black)A QQ-plot is made and a normalization curve is constructed by fitting a cubic spline functionAs reference one can use an artificial “median array” for a set of arrays or use a log-normal distribution, which is a good approximation.

Visualizing data

MVA plot

Lowess Normalization

One of the most commonly utilized normalization techniques is the LOcally Weighted Scatterplot Smoothing (LOWESS) algorithm.

* * * *** *

Invariant set normalization (Li and Wong)

A invariant set of probes is usedProbes that does does not change intensity rank between arraysA piecewise linear median line is calculatedThis curve is used for normalization

Comparison of Normalization Algorithms

QsplineLoessLinearInv. Set

(From Workman et al., 2002)

Calibration = Normalization = Scaling

Image Analysis

Buy Chip or Array

Some microarrays have multiple probes addressing the expression of the same gene

- Affymetrix GeneChips have 11-20 probe pairs pr. gene

PM: CGATCAATTGCACTATGTCATTTCTMM: CGATCAATTGCAGTATGTCATTTCT

-Perfect Match (PM)-MisMatch (MM)

Comparison of Methods

Reproducibility of Genetic Expression from 20 Replicates:

Blue and Red RMA; Black Li & Wong dChip; Green MAS 5.0

(By Terry Speed)

Image Analysis

Buy Chip or Array

Gene-specific variation:

Too random to be explicitly accounted for“noise”

Global variation:

Similar effect on many

measurementsCorrections can be estimated from data

Normalization Statistical testing

Sources of variation

Systematic Stochastic

Asking questions of your microarray data

• Typically we want to identify differentially expressed genes

• Example: • Alcohol dehydrogenase is expressed at a higher level

when alcohol is added to the media

without alcohol with alcohol

alcohol dehydrogenase

Significance Testing

wildtype+alcohol

wildtype

mutant +alcohol

mutant

Alas, the data contain stochastic noise

He’s going to say it

There is no way around it...

Statistics

You can think of statistics as a black box...

...but, you still need to perform the test and understand how to interpret the results

Noisymeasurements p-value

statistic

What's inside the black box of ‘statistics’?

t-tests, ANOVAs and Volcanoes

The notorious p-value

p-value = The chance of rejecting the null

hypothesis by coincidence----------------------------For gene expression analysis we can

say: The chance that a gene is categorized as

differential expressed by mistake

The t-test

The t statistic is based on the sample meanand variance

Then a lookup of t in a table of the t-distribution finds the p-value

ANOVA = ANalysis Of VAriance

• Very similar to the t-test, but can test multiple categories

• Ex: is gene x differentially expressed between wt, mutant 1 and mutant 2

• Advantage: it has more ‘power’ than the t-test

The two way ANOVA

wildtype+alcohol

wildtype

mutant +alcohol

mutant

1 13 3

The fine print...

We can rank the genesaccording to the p-value

But, we really can’t trust the p-value, in the strict statistical way!

We rarely fulfill all the assumptions of the statistical tests...

And we should take multiple testing into account

The Problem of Multiple Testing

1981• Prince Charles gets married• Liverpool wins the European

Championship League• Pope dies

2005• Prince Charles gets married• Liverpool wins the European

Championship League• Pope dies

The Problem of Multiple Testing

In a typical microarray analysis we test thousands of genes

If we use a significance level of 0.05 and we test 1000 genes. We would expect 50 genes to be significant by chance

1000 x 0.05 = 50

Correction for multiple testing

Bonferroni:

P ≤Confidence level of 99%

Benjamini-Hochberg:

P ≤ iN

N = Number of genesi = Ranking number of gene

But really, those methods are not the answer

What we can trust is the ability of statistic test to rank genes accordingly to their reliability

The number of genes that are needed or can be handled in downstream processes can be used to set the cutoff

If we permute the samples we then can get an estimate of the False Discovery Rate (FDR) in this set

...but then what do we do?

log2 fold change (M)

P-value

Conclusions

Array data contains stochastic noise– Statistics is needed to

conclude on differential expression

We can’t trust the p-value

But the statistics can rank genes– The capacity/needs of

downstream processes can be used to set cutoff

FDR can be estimated– e.g. Volcano Plots

t-test is used for two category tests

ANOVA is used for multiple categories

The “Result”—A Ranked Gene List

Relapse: Cluster Analysis

Clustering of 19 class predictive genes

Sensitivity:– Relapse: 87%– CCR: 74%

Specificity:– Relapse: 69%– CCR: 92%

204812_at ZW10

203073_at LDLC

202958_at PTPN9

202772_at HMGCL

218896_s_at HSA277841

204798_at MYB

206586_at CNR2

208594_x_at ILT8

204708_at MAPK4

221746_at UBL4

222140_s_at SH120

217942_at MDS023

202322_s_at GGPS1

222103_at ATF1

202630_at APPBP2

200050_at ZNF146

204327_s_at ZNF202

202535_at FADD

201746_at TP53

-0.48 0.00 0.48

Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization...

Documents

Transcript of Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization...

Microarray Data Normalization and Analysiscamda2009.bioinformatics.northwestern.edu/.../quackenbush/presen… · Microarray Data Normalization and Analysis John Quackenbush CAMDA

Database and R Interfacing for Annotated Microarray Data · PDF fileDBMS and the statistical environment. ... preprocessing and normalization, ... For the data-type handling issue

Lecture 9 Microarray experiments MA plots Normalization of microarray data Tests for differential expression of genes Multiple testing and FDR.

Quality control and normalization of microarray data · 2009-03-25 · Quality control and normalization of microarray data Lara Lusa Istituto Nazionale per lo Studio e la Cura dei

Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.

Statistical Methods for Analyzing Ordered Gene Expression Microarray Data

Statistical Analysis of Microarray Data · Clustering Statistical Analysis of Microarray Data Jacques van Helden Jacques.van-Helden@univ-amu.fr Aix-Marseille Université, France Technological

Normalization - cbs.dtu.dk · Normalization Image analysis The DNA Array Analysis Pipeline. DNA Microarray Bioinformatics - #27612 Expression index value Some microarrays have multiple

Statistical Issues in the Design of Microarray Experiments

Chapter 5: Microarray Techniques - Columbia University · Chapter 5: Microarray Techniques 5.2 Analysis of Microarray Data 2 Overview Normalization Clustering. 2 3 Processing Microarray

Statistical Analysis of DNA Microarray.

Statistical normalization techniques for magnetic ...

Lecture 6 Introduction to transcriptional networks Microarray experiments MA plots Normalization of microarray data Tests for differential expression.

Microarray Normalization Xiaole Shirley Liu STAT115 / STAT215.

Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.

Statistical Analysis of Microarray Data

Statistical Methods for Analyzing Tissue Microarray Datavnminin.github.io/papers/TissueMicroarray.pdf · Statistical Methods for Analyzing Tissue Microarray Data Xueli Liu1; 2, Vladimir

Microarray Normalization

Filtering and Normalization of Microarray Gene Expression Data

Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/...microarrays/...Normalization.pdf · Preprocess. Normalization Methods and Data Preprocessing Global Lowess Normalization