Henrik Bengtsson [email protected] Bioinformatics Group

Henrik [email protected]

Bioinformatics Group

Mathematical Statistics, Centre for Mathematical Sciences

Lund University

cDNA MicroarrayscDNA Microarrays--

an introductionan introduction

Outline

• The Genomic Code

• The Central Dogma of Biology

• The cDNA Microarray Technique

• Data Analysis of cDNA Microarray Data

• Statistical Problems

• Take-home message

The Genomic Code

3 180 000 000 bp

120.000 genes ? 80.000 genes ? 35.000 genes ?

or ?

22+1 chromosome pairs

The Central Dogma of Biology

DNA

RNA

Protein

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

The cDNA Microarray Technique

• High-throughput measuring- 5000-20000 gene expressions at the same time

• Identify genes that behaves different in different cell populations- tumor cells vs healthy cells- brain cells vs liver cells- same tissue different organisms

• Time series experiments- gene expressions over time after treatment

• ...

Example of a cDNA Microarray

Overview

microarray

scanning

analysis

cDNA clones(probes)

PCR product amplificationpurification

printing

0.1nl / spotHybridize

RNA

Tumor sample

cDNA

RNA

Reference sample

cDNA

excitation

red lasergreen

laser

emission

overlay images and normalise

Creating the slides

RNA Extraction & Hybridization

Hybridize

RNA

Tumor sample

cDNA

RNA

Reference sample

cDNA

Scanning & Image Analysis

Data Output

Biological questionDifferentially expressed genesSample class prediction etc.

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

R, G

16-bit TIFF files

(Rfg, Rbg), (Gfg, Gbg)

Transformed data {(M,A)}n=1..5184:

M = log2(R/G) (ratio),

A = log2(R·G)1/2 = 1/2·log2(R·G) (intensity signal)

R=(22A+M)1/2, G=(22A-M)1/2

Data Transformation

“Observed” data {(R,G)}n=1..5184:

R = red channel signalG = green channel signal

(background corrected or not)

Normalization

Biased towards the green channel & Intensity dependent artifacts

Replicated measurements

Scaled print-tip normalization

Median Absolute Deviation (MAD) Scaling

Averaging

Identification of differentially expressed genes

Extreme in T values?

Extreme in M values?...or extreme in some other statistics?

List of genes that the biologist can understand and verify with other experiments

Gene: Mavg Aavg T SE

2341 -0.86 10.9 -18.0 0.125 6412 -0.75 11.1 -14.7 0.102 6123 -0.70 9.8 -12.2 0.121

102 0.65 10.3 -14.5 0.136 2020 0.64 9.3 -11.9 0.118 3132 0.62 9.9 -14.4 0.090 4439 -0.62 9.7 -14.6 0.088 2031 -0.61 10.7 -13.7 0.087

657 -0.60 9.2 -13.6 0.094 502 0.58 10.0 -12.7 0.101

1239 -0.58 9.8 -11.4 0.103 5392 -0.57 9.9 -20.7 0.057 3921 0.52 11.3 13.5 0.083

...

Time Course Gene Expression Profiles

Statistical Problems10. Which genes are actually up- and down

regulated?

11. P-values.

12. Planning of experiments:- what is best design?- what is an optimal sample sizes?

13. Classification:- of samples.- of genes.

14. Clustering:- of samples.- of genes.

15. Time course experiments.

16. Gene networks.- identification of pathways

17. ...

1. Image analysis- what is foreground?- what is background?

2. Quality- which spots can we trust?- which slides can we trust?

3. Artifacts from preparing the RNA, the printing, the scanning etc.

4. Data cleanup

5. Normalization within an experiment:- when few genes change.- when many genes change.- dye-swap to minimize dye effects.

6. Normalization between experiments:- location and scale effects.

7. What is noise and what is variability?

Total microarray articles indexed in Medline

1995 1996 1997 1998 1999 2000 2001

0

100

200

300

400

500

600

(projected)

Year

Num

ber

of

papers

Acknowledgments/Collaborators

Statistics Dept, UC Berkeley:

Sandrine Dudoit

Terry Speed

Yee Hwa Yang

CSIRO Image Analysis Group, Melbourne:

Michael Buckley

Oncology Dept, Lund University:

Pär-Ola Bendahl

Åke Borg

Johan Vallon-Christersson

Lawrence Berkeley National Laboratory:

Saira Mian

Matt Callow

Endocrinology, Lund University, Malmö:

Leif Groop

Peter Almgren

Mathematical Statistics, Chalmers University:

Olle Nerman

Staffan Nilsson

Dragi Anevski

Enerst Gallo Research Inst., California:

Monica Moore

Karen Berger

Take-home message

• Bioinformatics is the future!

• More educated people are needed!

• Statistics is fun when it is applied!

• Master’s thesis project? Talk to us!

http://www.maths.lth.se/matstat/bioinformatics/

Finding genes in DNA sequence“This is one of the most challenging and interesting problems in computational biology at the moment. With so many genomes being sequenced so rapidly, it remains important to begin by identifying genes computationally.” – Terry Speed.

The Central Dogma of Biology

SequencingFragment assemblyGene finding Linkage analysis etcHomology searchesAnnotation

IsolationSequencingRNA structure predictionGene expression: microarrays etc

Protein structure prediction Protein foldingHomology searchesFunctional pathwaysAnnotation

Challenges:

DNA

RNA

Protein

transcription

translation

Henrik Bengtsson [email protected] Bioinformatics Group

Documents

Transcript of Henrik Bengtsson [email protected] Bioinformatics Group