Henrik Bengtsson (MSc CS, PhD Statistics) Dept of Statistics, UC Berkeley
Henrik Bengtsson [email protected] Bioinformatics Group
description
Transcript of Henrik Bengtsson [email protected] Bioinformatics Group
Henrik [email protected]
Bioinformatics Group
Mathematical Statistics, Centre for Mathematical Sciences
Lund University
cDNA MicroarrayscDNA Microarrays--
an introductionan introduction
Outline
• The Genomic Code
• The Central Dogma of Biology
• The cDNA Microarray Technique
• Data Analysis of cDNA Microarray Data
• Statistical Problems
• Take-home message
The Genomic Code
3 180 000 000 bp
120.000 genes ? 80.000 genes ? 35.000 genes ?
or ?
22+1 chromosome pairs
The Central Dogma of Biology
DNA
RNA
Protein
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
The cDNA Microarray Technique
• High-throughput measuring- 5000-20000 gene expressions at the same time
• Identify genes that behaves different in different cell populations- tumor cells vs healthy cells- brain cells vs liver cells- same tissue different organisms
• Time series experiments- gene expressions over time after treatment
• ...
Example of a cDNA Microarray
Overview
microarray
scanning
analysis
cDNA clones(probes)
PCR product amplificationpurification
printing
0.1nl / spotHybridize
RNA
Tumor sample
cDNA
RNA
Reference sample
cDNA
excitation
red lasergreen
laser
emission
overlay images and normalise
Creating the slides
RNA Extraction & Hybridization
Hybridize
RNA
Tumor sample
cDNA
RNA
Reference sample
cDNA
Scanning & Image Analysis
Data Output
Biological questionDifferentially expressed genesSample class prediction etc.
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Discrimination
R, G
16-bit TIFF files
(Rfg, Rbg), (Gfg, Gbg)
Transformed data {(M,A)}n=1..5184:
M = log2(R/G) (ratio),
A = log2(R·G)1/2 = 1/2·log2(R·G) (intensity signal)
R=(22A+M)1/2, G=(22A-M)1/2
Data Transformation
“Observed” data {(R,G)}n=1..5184:
R = red channel signalG = green channel signal
(background corrected or not)
Normalization
Biased towards the green channel & Intensity dependent artifacts
Replicated measurements
Scaled print-tip normalization
Median Absolute Deviation (MAD) Scaling
Averaging
Identification of differentially expressed genes
Extreme in T values?
Extreme in M values?...or extreme in some other statistics?
List of genes that the biologist can understand and verify with other experiments
Gene: Mavg Aavg T SE
2341 -0.86 10.9 -18.0 0.125 6412 -0.75 11.1 -14.7 0.102 6123 -0.70 9.8 -12.2 0.121
102 0.65 10.3 -14.5 0.136 2020 0.64 9.3 -11.9 0.118 3132 0.62 9.9 -14.4 0.090 4439 -0.62 9.7 -14.6 0.088 2031 -0.61 10.7 -13.7 0.087
657 -0.60 9.2 -13.6 0.094 502 0.58 10.0 -12.7 0.101
1239 -0.58 9.8 -11.4 0.103 5392 -0.57 9.9 -20.7 0.057 3921 0.52 11.3 13.5 0.083
...
Time Course Gene Expression Profiles
Statistical Problems10. Which genes are actually up- and down
regulated?
11. P-values.
12. Planning of experiments:- what is best design?- what is an optimal sample sizes?
13. Classification:- of samples.- of genes.
14. Clustering:- of samples.- of genes.
15. Time course experiments.
16. Gene networks.- identification of pathways
17. ...
1. Image analysis- what is foreground?- what is background?
2. Quality- which spots can we trust?- which slides can we trust?
3. Artifacts from preparing the RNA, the printing, the scanning etc.
4. Data cleanup
5. Normalization within an experiment:- when few genes change.- when many genes change.- dye-swap to minimize dye effects.
6. Normalization between experiments:- location and scale effects.
7. What is noise and what is variability?
Total microarray articles indexed in Medline
1995 1996 1997 1998 1999 2000 2001
0
100
200
300
400
500
600
(projected)
Year
Num
ber
of
papers
Acknowledgments/Collaborators
Statistics Dept, UC Berkeley:
Sandrine Dudoit
Terry Speed
Yee Hwa Yang
CSIRO Image Analysis Group, Melbourne:
Michael Buckley
Oncology Dept, Lund University:
Pär-Ola Bendahl
Åke Borg
Johan Vallon-Christersson
Lawrence Berkeley National Laboratory:
Saira Mian
Matt Callow
Endocrinology, Lund University, Malmö:
Leif Groop
Peter Almgren
Mathematical Statistics, Chalmers University:
Olle Nerman
Staffan Nilsson
Dragi Anevski
Enerst Gallo Research Inst., California:
Monica Moore
Karen Berger
Take-home message
• Bioinformatics is the future!
• More educated people are needed!
• Statistics is fun when it is applied!
• Master’s thesis project? Talk to us!
http://www.maths.lth.se/matstat/bioinformatics/
Finding genes in DNA sequence“This is one of the most challenging and interesting problems in computational biology at the moment. With so many genomes being sequenced so rapidly, it remains important to begin by identifying genes computationally.” – Terry Speed.
The Central Dogma of Biology
SequencingFragment assemblyGene finding Linkage analysis etcHomology searchesAnnotation
IsolationSequencingRNA structure predictionGene expression: microarrays etc
Protein structure prediction Protein foldingHomology searchesFunctional pathwaysAnnotation
Challenges:
DNA
RNA
Protein
transcription
translation