1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

40
1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007

Transcript of 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

Page 1: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

1

Microarrays

Naomi Altman

Dept. of Statistics and

PSU

March 7, 2007

Page 2: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

2

Microarrays 100 A Statistician’s Simplification

A microarray is a piece of glass or polymer with several thousand spots, each of which contains thousands of copies of a short piece of 1 strand of the "double helix" of DNA or of cDNA (to be explained).

The rungs of the DNA ladder consists of 2 bound codons which are designated C, G, A, T. These are called base pairs. Each spot on a microarray consists of a piece of one side of the ladder with the attached base.

C binds only to G. A binds only to T.

A sample containing an unknown number of the complementary strands is labeled and hybridized to the array.

The response is a measure of the quantity of label for each spot, which should be proportional to the number of complementary strands in the sample.

http://www.bioteach.ubc.ca/MolecularBiology/AMonksFlourishingGarden/

Page 3: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

3

1.DNA 100 - what we need to know to understand what a microarray can measure.

2. What can a microarray measure?

3.Where does the material printed on the microarray come from?

4.What does a microarray experiment "look like" and where do statistical methods fit in?

5. (Time permitting) Gene expression experiments and the p >> n problem.

Agenda

Page 4: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

4

DNA 100 A Statistician’s Simplification

Every cell in an organism has the same genetic material, stored in the double helix of DNA.

In a diploid population, most cells have 2 copies of each chromosome.

Genes are the part of the DNA that code for proteins, but there are many other important features that interest biologists.

http://www.accessexcellence.org/RC/VL/GG/chromosome.html

Page 5: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

5

Transcription (Making RNA)

www.csu.edu.au/faculty/health/biomed/subjects/molbol/basic.htm

•transcription factors bind to the promoter and bind RNA polymerase

•DNA strands separate and transcription is initiated

•transcription continues in the 3'-5' direction until the stop codons are reached

•The completed RNA strand is released for post-processing

Page 6: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

6

Introns and Exons

In "higher" organisms, the gene contains noncoding regions, called introns, and coding regions called exons.

The introns are spliced out of the mRNA before translation into protein.

"Splicing variants" can be formed by the cell selecting combinations of the exons.

The resulting spliced strand is the mRNA.

We can "predict" exons using statistical algorithms, but the gold standard is that only exons match mRNA sequences

At each end of the mRNA is an untranslated region (UTR) which is unique to the gene.

http://biology.unm.edu/ccouncil/Biology_124/Summaries/T&T.html

Chromosome

promoter

Page 7: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

7

cDNARNA is much less stable than DNA.

To preserve the exon sequence, and for printing microarrays, reverse transcription is used in the lab to convert the RNA into the complementary cDNA.

cDNA can be preserved by inserting it into the genome of a living microbe (cDNA library).

Page 8: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

8

DNA 100 A Statistician’s Simplification

DNA is complicated stuff.

Protein-coding regions are called genes. There are also other functional parts to the DNA, some of which code for RNA and some of which are regulatory regions - i.e. they help control how the coding regions are used - e.g. promoters

The supercoiling of the DNA may also control how the coding regions are used.

As well, there is a lot of DNA which appears to be "junk" - i.e. to date no function is known. But we keep making new discoveries - e.g. some of the "junk" codes for small RNA pieces that are functional.

Page 9: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

9

What can be measured on a microarray?

1. Amount of mRNA expressed by a gene.

2. Amount of mRNA expressed by an exon.

3. Amount of RNA expressed by a region of DNA.

4. Which strand of DNA is expressed.

5. Which of several similar DNA sequences is present in the genome.

6. How many copies of a gene is present in the genome.

7. Where a known protein has bound to the DNA. (ChIP on chip)

Page 10: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

10

What can be measured on a microarray?

1. Amount of mRNA expressed by a gene.

gene expression array, exon array, tiling array

2. Amount of mRNA expressed by an exon.

exon array, tiling array

3. Amount of RNA expressed by a region of DNA.

tiling array

4. Which strand of DNA is expressed.

exon array, tiling array

5. Which of several similar DNA sequences is present in the genome.

SNP array

6. How many copies of a gene is present in the genome.

gene expression array, exon array, tiling array

7. Where a known protein has bound to the DNA. (ChIP on chip)

promoter array, tiling array

Page 11: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

11

Types of Microarrays

Exon 1 Exon 2 Exon 3UTR UTR

A cDNA microarray can be made from the unsequenced cDNA library. All the other types require that the sequence be available.

oligo

exon exon exon

cDNA

chromosome sequence

CCGTTCACATTAGGATACCAGTTCAAGGCCGTTCACATTAGGATACCAGTTCAAGGAGGCCGTTCAGTTCACATTA

tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile

promoter

CCGTTCACA AAGGCCGTT

CCGTGCACA AAGGACGTTSNP

cDNA sequence

Page 12: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

12

Print Technology

The cDNA or oligo can be:

1. Printed on the slide using an "arraying robot" which deposits a drop of liquid containing the material at each spot. (gene expression only)

40,000+ spots

2. Oligos (all the same length) can be synthesized on the slide using:

i) inkjet technology

ii) photolithography

1,000,000+ spots

3. There are other technologies that give similar types of results (e.g. "beads").

Page 13: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

13

Spotted 2-Channel Array

http://www.anst.uu.se/frgra677/bilder/micro_method_large.jpg

Spotted arrays are printed on coated microscope slides.

2 RNA samples are converted to cDNA. Each is labelled with a different dye.

Page 14: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

14

Format of an Affymetrix Array

http://cnx.rice.edu/content/m12388/latest/figE.JPG

Page 15: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

15

Microarray experiments

Obtain sequence info

select oligos

Print microarray

Print or buy the microarray

Page 16: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

16

Microarray experiments

Obtain sequence info

select oligos

Print microarray

Print or buy the microarray

sequencing errorassembly errorcontamination

uniquesimilar hybridization rates

Page 17: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

17

Microarray experiments

Obtain sequence info

select oligos

Print microarray

obtain tissue sample

extract RNA

extract mRNA

label

normalize mRNA

Print or buy the microarray Create the labeled samples

Page 18: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

18

Microarray experiments

Obtain sequence info

select oligos

Print microarray

obtain tissue sample

extract RNA

extract mRNA

label

normalize mRNA

Print or buy the microarray Create the labeled samples

experimental design-number of biological replicates-technical replicatesblockssample pooling

Page 19: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

19

Microarray experiments

Obtain sequence info

select oligos

Print microarray

obtain tissue sample

extract RNA

extract mRNA

label

normalize mRNA

Print or buy the microarray Create the labeled samples

hybridize

Page 20: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

20

Microarray experiments

Obtain sequence info

select oligos

Print microarray

obtain tissue sample

extract RNA

extract mRNA

label

normalize mRNA

Print or buy the microrray Create the labeled samples

hybridize

hybridization design (multichannel)

Page 21: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

21

Microarray experiments

hybridize scan

detect spots

compute spot summary

detect background

detect bad spots

process image

remove array specific noise

Page 22: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

22

Microarray experiments

hybridize scan

detect spots

compute spot summary

detect background

detect bad spots

spot detection software

pixel mean, median ...background correction

detection limitbackground > foregroundbadly printed spotsflaws

process image using multiple scans

remove array specific noise normalization

Page 23: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

23

Rafael A Irizarry,Department of

Biostatistics [email protected]

http://www.biostat.jhsph.edu/~ririzarr

http://www.bioconductor.org

nci 2002

Spot Detection

Adaptive segmentation Fixed circle segmentation

---- GenePix

---- QuantArray

---- ScanAnalyze

Spot uses morphological opening

Page 24: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

24

Gene Expression Microarray experiments

obtain numerical summary for each gene or exon on each array

sampleclassification

clustering genesand samples

differential expression analysis

Page 25: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

25

Gene Expression Microarray experiments

obtain numerical summary for each gene or exon on each array

sampleclassification

clustering genesand samples

t-tests, ANOVABayesian versions of aboveFourier analysis of time seriesFalse discovery and nondiscovery rates

differential expression analysis

robust methods to downweight outliersdata imputation (if needed)

discriminant analysissupport vector machinessupervised learning

unsupervised learninghierarchical clusteringk-means clusteringheatmaps

Page 26: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

26

A heatmap

samples of different regions of the brain in humans and chimpanzees

sample clusters show that different regions of the brain cluster more closely than different species

gene clusters show that some genes differentiateamong brain regions while other differentiate the

2 species ☺

Page 27: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

27

SNP Microarray experiments

obtain numerical summary for each SNP

estimate SNP frequency

haplotyping

association with subpopulation

Page 28: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

28

SNP Microarray experiments

obtain numerical summary for each SNP

estimate SNP frequency

haplotyping

association with subpopulation

binomial distribution

determine which sets of SNPs comefrom each of the 2 chromosomes

association with disease, ecotype, etcmultivariate analysismixture models

Page 29: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

29

Tiling Microarray experiments

obtain numerical summary for each codon on the chromosome

visualization

Page 30: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

30

Tiling Microarray experiments

obtain numerical summary for each codon on the chromosome

visualization

nonparametric smoothing

Page 31: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

31

p >> n

n=#samples

Usually, we have some type of response on the samples which may be quantitative (e.g. body mass index, HDL) or categorical (cancer type, growth stage ...)

p=#measurements

which may be the intensity per gene, exon, locus, promoter region ...

Page 32: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

32

p >> n

n=#samples

Call the response Y typically a n x 1 vector (e.g. BMI)

p=#measurements

Call the measurements X, an n x p matrix

Page 33: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

33

p >> n

n=#samples

Call the response Y typically a n x 1 vector (e.g. BMI)

p=#measurements

Call the measurements X, an n x p matrix.

Typically we might think that Y is related to a small set of the X measures, e.g. k<<n of the p variables.

e.g. Y=U

where U is the n x k matrix of measurements, is an unknown vector of constants andis random.

Page 34: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

34

p >> n

n=#samples

Call the response Y typically a n x 1 vector (e.g. BMI)

p=#measurements

Call the measurements X, an n x p matrix.

Typically we might think that Y is related to a small set of the X measures, e.g. k<<n of the p variables.

e.g. Y=U

where U is the n x k matrix of measurements, is an unknown vector of constants andis random.

If we try to solve Y=Xand X has rank n, we will always find an exact solution.

In fact, if we select any submatrix of columns of X with rank n, we will always find an exact solution, even if those columns are completely independent of U.

Page 35: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

35

p >> n

Another approach is to try to predict using 1 column of X at a time.

If none of the columns are in U (so that the corresponding coefficients are 0), then, if we do any statistical test for =0 and reject for p-value <, we will reject p of the tests and conclude that the corresponding spots are associated with Y.

Because usually p>10000, we will make a lot of mistakes unless is extremely small.

Page 36: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

36

p >> n

All of the special methods for analysis of gene expression data are developed to solve the p >> n problem.

Page 37: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

37

p >> n

e.g. 2-sample problem

2 conditions: e.g. cancer normal

For gene (exon, locus ...) i, we have n samples with p genes and observe

Yijk i = gene id j= condition id k=sample id

Usual method: 2-sample t-test:

n

S

n

S

YYt

normalicanceri

normalicanceri

2,

2,

,,*

Page 38: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

38

p >> n

e.g. 2-sample problem

Some ideas to improve selection of

differentially expressed genes:

1) Force all genes to have the same variability (2-way ANOVA) by the normalization step.

2) assume that there is a distribution of gene means known in advance or estimated from the data (Bayesian or empirical Bayes methods).

3) Use the data to estimate the number of inference errors.

4) Force the data to be normally distributed (within gene) in the normalization step or use bootstrap or permutation methods (suitable for fairly large sample size).

n

S

n

S

YYt

normalicanceri

normalicanceri

2,

2,

,,*

Page 39: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

39

Unsolved Problems

People are still working on normalization, differential expression analysis, clustering and classification for gene expression arrays. There are also problems in combining data from other sources including measurements from other platforms, meta-analysis, and data from the literature.

These problems are not dead, but it will be increasingly difficult to find new problems without a paradigm change.

The new arrays (exon, SNP, tiling) will need more new methodology.

Page 40: 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

40

AcknowledgementsFrancesca Chiaromonte

Floral Genome Project

dePamphilis Lab

Ma Lab

Carlson Lab

McNellis Lab

Pugh Lab

Fedorov Lab

Tony Hua

Xianyun Mao

Bioinformatics Consulting CenterHuck Institute Diya Zhang Wenlei Liu Qing Zhang

Allison Lab (UAB)