Download - Microarrays

1

Microarrays

Naomi AltmanDept. of Statistics and

PSU

March 7, 2007

2

Microarrays 100 A Statistician’s Simplification

A microarray is a piece of glass or polymer with several thousand spots, each of which contains thousands of copies of a short piece of 1 strand of the "double helix" of DNA or of cDNA (to be explained).

The rungs of the DNA ladder consists of 2 bound codons which are designated C, G, A, T. These are called base pairs. Each spot on a microarray consists of a piece of one side of the ladder with the attached base.

C binds only to G. A binds only to T.

A sample containing an unknown number of the complementary strands is labeled and hybridized to the array.

The response is a measure of the quantity of label for each spot, which should be proportional to the number of complementary strands in the sample.

http://www.bioteach.ubc.ca/MolecularBiology/AMonksFlourishingGarden/

3

1.DNA 100 - what we need to know to understand what a microarray can measure.

2. What can a microarray measure?

3.Where does the material printed on the microarray come from?

4.What does a microarray experiment "look like" and where do statistical methods fit in?

5. (Time permitting) Gene expression experiments and the p >> n problem.

Agenda

4

DNA 100 A Statistician’s Simplification

Every cell in an organism has the same genetic material, stored in the double helix of DNA.

In a diploid population, most cells have 2 copies of each chromosome.

Genes are the part of the DNA that code for proteins, but there are many other important features that interest biologists.

http://www.accessexcellence.org/RC/VL/GG/chromosome.html

5

Transcription (Making RNA)

www.csu.edu.au/faculty/health/biomed/subjects/molbol/basic.htm

•transcription factors bind to the promoter and bind RNA polymerase

•DNA strands separate and transcription is initiated

•transcription continues in the 3'-5' direction until the stop codons are reached

•The completed RNA strand is released for post-processing

6

Introns and Exons

In "higher" organisms, the gene contains noncoding regions, called introns, and coding regions called exons.

The introns are spliced out of the mRNA before translation into protein.

"Splicing variants" can be formed by the cell selecting combinations of the exons.

The resulting spliced strand is the mRNA.

We can "predict" exons using statistical algorithms, but the gold standard is that only exons match mRNA sequences

At each end of the mRNA is an untranslated region (UTR) which is unique to the gene.

http://biology.unm.edu/ccouncil/Biology_124/Summaries/T&T.html

Chromosome

promoter

7

cDNARNA is much less stable than DNA.To preserve the exon sequence, and for printing

microarrays, reverse transcription is used in the lab to convert the RNA into the complementary cDNA.

cDNA can be preserved by inserting it into the genome of a living microbe (cDNA library).

8

DNA 100 A Statistician’s Simplification

DNA is complicated stuff.

Protein-coding regions are called genes. There are also other functional parts to the DNA, some of which code for RNA and some of which are regulatory regions - i.e. they help control how the coding regions are used - e.g. promoters

The supercoiling of the DNA may also control how the coding regions are used.

As well, there is a lot of DNA which appears to be "junk" - i.e. to date no function is known. But we keep making new discoveries - e.g. some of the "junk" codes for small RNA pieces that are functional.

9

What can be measured on a microarray?1. Amount of mRNA expressed by a gene.

2. Amount of mRNA expressed by an exon.

3. Amount of RNA expressed by a region of DNA.

4. Which strand of DNA is expressed.

5. Which of several similar DNA sequences is present in the genome.

6. How many copies of a gene is present in the genome.

7. Where a known protein has bound to the DNA. (ChIP on chip)

10

What can be measured on a microarray?1. Amount of mRNA expressed by a gene.

gene expression array, exon array, tiling array2. Amount of mRNA expressed by an exon.

exon array, tiling array3. Amount of RNA expressed by a region of DNA.

tiling array4. Which strand of DNA is expressed.

exon array, tiling array5. Which of several similar DNA sequences is present in the genome.

SNP array6. How many copies of a gene is present in the genome.

gene expression array, exon array, tiling array7. Where a known protein has bound to the DNA. (ChIP on chip)

promoter array, tiling array

11

Types of Microarrays

Exon 1 Exon 2 Exon 3UTR UTR

A cDNA microarray can be made from the unsequenced cDNA library. All the other types require that the sequence be available.

oligo

exon exon exon

cDNA

chromosome sequence

CCGTTCACATTAGGATACCAGTTCAAGGCCGTTCACATTAGGATACCAGTTCAAGGAGGCCGTTCAGTTCACATTA

tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile

promoter

CCGTTCACA AAGGCCGTT

CCGTGCACA AAGGACGTTSNP

cDNA sequence

12

Print TechnologyThe cDNA or oligo can be:

1. Printed on the slide using an "arraying robot" which deposits a drop of liquid containing the material at each spot. (gene expression only)40,000+ spots

2. Oligos (all the same length) can be synthesized on the slide using:i) inkjet technologyii) photolithography

1,000,000+ spots

3. There are other technologies that give similar types of results (e.g. "beads").

13

Spotted 2-Channel Array

http://www.anst.uu.se/frgra677/bilder/micro_method_large.jpg

Spotted arrays are printed on coated microscope slides.

2 RNA samples are converted to cDNA. Each is labelled with a different dye.

14

Format of an Affymetrix Array

http://cnx.rice.edu/content/m12388/latest/figE.JPG

15

Microarray experiments

Obtain sequence info

select oligos

Print microarray

Print or buy the microarray

16



select oligos

Print microarray

Print or buy the microarray

sequencing errorassembly errorcontamination

uniquesimilar hybridization rates

17



select oligos

Print microarray

obtain tissue sample

extract RNA

extract mRNA

label

normalize mRNA

Print or buy the microarray Create the labeled samples

18



select oligos

Print microarray


extract RNA

extract mRNA

label

normalize mRNA


experimental design-number of biological replicates-technical replicatesblockssample pooling

19



select oligos

Print microarray


extract RNA

extract mRNA

label

normalize mRNA


hybridize

20



select oligos

Print microarray


extract RNA

extract mRNA

label

normalize mRNA

Print or buy the microrray Create the labeled samples

hybridize

hybridization design (multichannel)

21

Microarray experimentshybridize scan

detect spots

compute spot summary

detect background

detect bad spots

process image

remove array specific noise

22

Microarray experimentshybridize scan

detect spots

compute spot summary

detect background

detect bad spots

spot detection software

pixel mean, median ...background correction

detection limitbackground > foregroundbadly printed spotsflaws

process image using multiple scans

remove array specific noise normalization

23

Rafael A Irizarry,Department of

Biostatistics [email protected]

http://www.biostat.jhsph.edu/~ririzarr

http://www.bioconductor.org

nci 2002

Spot Detection

Adaptive segmentation Fixed circle segmentation

---- GenePix

---- QuantArray

---- ScanAnalyze

Spot uses morphological opening

mailto:[email protected]




http://www.bioconductor.org/



24

Gene Expression Microarray experimentsobtain numerical summary for each gene or exon on each array

sampleclassification

clustering genesand samples

differential expression analysis

25

Gene Expression Microarray experimentsobtain numerical summary for each gene or exon on each array

sampleclassification

clustering genesand samples

t-tests, ANOVABayesian versions of aboveFourier analysis of time seriesFalse discovery and nondiscovery rates

differential expression analysis

robust methods to downweight outliersdata imputation (if needed)

discriminant analysissupport vector machinessupervised learning

unsupervised learninghierarchical clusteringk-means clusteringheatmaps

26

A heatmap

samples of different regions of the brain in humans and chimpanzees

sample clusters show that different regions of the brain cluster more closely than different species

gene clusters show that some genes differentiateamong brain regions while other differentiate the 2 species ☺

27

SNP Microarray experimentsobtain numerical summary for each SNP

estimate SNP frequency

haplotyping

association with subpopulation

28

SNP Microarray experimentsobtain numerical summary for each SNP

estimate SNP frequency

haplotyping

association with subpopulation

binomial distribution

determine which sets of SNPs comefrom each of the 2 chromosomes

association with disease, ecotype, etcmultivariate analysismixture models

29

Tiling Microarray experimentsobtain numerical summary for each codon on the chromosome

visualization

30

Tiling Microarray experimentsobtain numerical summary for each codon on the chromosome

visualization

nonparametric smoothing

31

p >> n

n=#samples

Usually, we have some type of response on the samples which may be quantitative (e.g. body mass index, HDL) or categorical (cancer type, growth stage ...)

p=#measurements

which may be the intensity per gene, exon, locus, promoter region ...

32

p >> n

n=#samples

Call the response Y typically a n x 1 vector (e.g. BMI)

p=#measurements

Call the measurements X, an n x p matrix

33

p >> n

n=#samples


p=#measurements

Call the measurements X, an n x p matrix.

Typically we might think that Y is related to a small set of the X measures, e.g. k<<n of the p variables.

e.g. Y=U

where U is the n x k matrix of measurements, is an unknown vector of constants andis random.

34

p >> nn=#samples


p=#measurements

Call the measurements X, an n x p matrix.

Typically we might think that Y is related to a small set of the X measures, e.g. k<<n of the p variables.

e.g. Y=U

where U is the n x k matrix of measurements, is an unknown vector of constants andis random.

If we try to solve Y=Xand X has rank n, we will always find an exact solution.

In fact, if we select any submatrix of columns of X with rank n, we will always find an exact solution, even if those columns are completely independent of U.

35

p >> n

Another approach is to try to predict using 1 column of X at a time.

If none of the columns are in U (so that the corresponding coefficients are 0), then, if we do any statistical test for =0 and reject for p-value <, we will reject p of the tests and conclude that the corresponding spots are associated with Y.

Because usually p>10000, we will make a lot of mistakes unless is extremely small.

36

p >> n

All of the special methods for analysis of gene expression data are developed to solve the p >> n problem.

37

p >> ne.g. 2-sample problem

2 conditions: e.g. cancer normalFor gene (exon, locus ...) i, we have n samples with p genes and

observe

Yijk i = gene id j= condition id k=sample id

Usual method: 2-sample t-test:

nS

nS

YYt

normalicanceri

normalicanceri

2,

2,

,,*

38

p >> ne.g. 2-sample problem

Some ideas to improve selection ofdifferentially expressed genes:1) Force all genes to have the same variability (2-way ANOVA) by

the normalization step.2) assume that there is a distribution of gene means known in

advance or estimated from the data (Bayesian or empirical Bayes methods).

3) Use the data to estimate the number of inference errors.4) Force the data to be normally distributed (within gene) in the

normalization step or use bootstrap or permutation methods (suitable for fairly large sample size).

nS

nS

YYt

normalicanceri

normalicanceri

2,

2,

,,*

39

Unsolved Problems

People are still working on normalization, differential expression analysis, clustering and classification for gene expression arrays. There are also problems in combining data from other sources including measurements from other platforms, meta-analysis, and data from the literature.

These problems are not dead, but it will be increasingly difficult to find new problems without a paradigm change.

The new arrays (exon, SNP, tiling) will need more new methodology.

40

AcknowledgementsFrancesca Chiaromonte

Floral Genome Project dePamphilis Lab Ma Lab Carlson LabMcNellis LabPugh LabFedorov LabTony HuaXianyun Mao

Bioinformatics Consulting CenterHuck Institute Diya Zhang Wenlei Liu Qing Zhang

Allison Lab (UAB)