1
Microarrays
Naomi AltmanDept. of Statistics and
PSU
March 7, 2007
2
Microarrays 100 A Statistician’s Simplification
A microarray is a piece of glass or polymer with several thousand spots, each of which contains thousands of copies of a short piece of 1 strand of the "double helix" of DNA or of cDNA (to be explained).
The rungs of the DNA ladder consists of 2 bound codons which are designated C, G, A, T. These are called base pairs. Each spot on a microarray consists of a piece of one side of the ladder with the attached base.
C binds only to G. A binds only to T.
A sample containing an unknown number of the complementary strands is labeled and hybridized to the array.
The response is a measure of the quantity of label for each spot, which should be proportional to the number of complementary strands in the sample.
http://www.bioteach.ubc.ca/MolecularBiology/AMonksFlourishingGarden/
3
1.DNA 100 - what we need to know to understand what a microarray can measure.
2. What can a microarray measure?
3.Where does the material printed on the microarray come from?
4.What does a microarray experiment "look like" and where do statistical methods fit in?
5. (Time permitting) Gene expression experiments and the p >> n problem.
Agenda
4
DNA 100 A Statistician’s Simplification
Every cell in an organism has the same genetic material, stored in the double helix of DNA.
In a diploid population, most cells have 2 copies of each chromosome.
Genes are the part of the DNA that code for proteins, but there are many other important features that interest biologists.
http://www.accessexcellence.org/RC/VL/GG/chromosome.html
5
Transcription (Making RNA)
www.csu.edu.au/faculty/health/biomed/subjects/molbol/basic.htm
•transcription factors bind to the promoter and bind RNA polymerase
•DNA strands separate and transcription is initiated
•transcription continues in the 3'-5' direction until the stop codons are reached
•The completed RNA strand is released for post-processing
6
Introns and Exons
In "higher" organisms, the gene contains noncoding regions, called introns, and coding regions called exons.
The introns are spliced out of the mRNA before translation into protein.
"Splicing variants" can be formed by the cell selecting combinations of the exons.
The resulting spliced strand is the mRNA.
We can "predict" exons using statistical algorithms, but the gold standard is that only exons match mRNA sequences
At each end of the mRNA is an untranslated region (UTR) which is unique to the gene.
http://biology.unm.edu/ccouncil/Biology_124/Summaries/T&T.html
Chromosome
promoter
7
cDNARNA is much less stable than DNA.To preserve the exon sequence, and for printing
microarrays, reverse transcription is used in the lab to convert the RNA into the complementary cDNA.
cDNA can be preserved by inserting it into the genome of a living microbe (cDNA library).
8
DNA 100 A Statistician’s Simplification
DNA is complicated stuff.
Protein-coding regions are called genes. There are also other functional parts to the DNA, some of which code for RNA and some of which are regulatory regions - i.e. they help control how the coding regions are used - e.g. promoters
The supercoiling of the DNA may also control how the coding regions are used.
As well, there is a lot of DNA which appears to be "junk" - i.e. to date no function is known. But we keep making new discoveries - e.g. some of the "junk" codes for small RNA pieces that are functional.
9
What can be measured on a microarray?1. Amount of mRNA expressed by a gene.
2. Amount of mRNA expressed by an exon.
3. Amount of RNA expressed by a region of DNA.
4. Which strand of DNA is expressed.
5. Which of several similar DNA sequences is present in the genome.
6. How many copies of a gene is present in the genome.
7. Where a known protein has bound to the DNA. (ChIP on chip)
10
What can be measured on a microarray?1. Amount of mRNA expressed by a gene.
gene expression array, exon array, tiling array2. Amount of mRNA expressed by an exon.
exon array, tiling array3. Amount of RNA expressed by a region of DNA.
tiling array4. Which strand of DNA is expressed.
exon array, tiling array5. Which of several similar DNA sequences is present in the genome.
SNP array6. How many copies of a gene is present in the genome.
gene expression array, exon array, tiling array7. Where a known protein has bound to the DNA. (ChIP on chip)
promoter array, tiling array
11
Types of Microarrays
Exon 1 Exon 2 Exon 3UTR UTR
A cDNA microarray can be made from the unsequenced cDNA library. All the other types require that the sequence be available.
oligo
exon exon exon
cDNA
chromosome sequence
CCGTTCACATTAGGATACCAGTTCAAGGCCGTTCACATTAGGATACCAGTTCAAGGAGGCCGTTCAGTTCACATTA
tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile
promoter
CCGTTCACA AAGGCCGTT
CCGTGCACA AAGGACGTTSNP
cDNA sequence
12
Print TechnologyThe cDNA or oligo can be:
1. Printed on the slide using an "arraying robot" which deposits a drop of liquid containing the material at each spot. (gene expression only)40,000+ spots
2. Oligos (all the same length) can be synthesized on the slide using:i) inkjet technologyii) photolithography
1,000,000+ spots
3. There are other technologies that give similar types of results (e.g. "beads").
13
Spotted 2-Channel Array
http://www.anst.uu.se/frgra677/bilder/micro_method_large.jpg
Spotted arrays are printed on coated microscope slides.
2 RNA samples are converted to cDNA. Each is labelled with a different dye.
14
Format of an Affymetrix Array
http://cnx.rice.edu/content/m12388/latest/figE.JPG
15
Microarray experiments
Obtain sequence info
select oligos
Print microarray
Print or buy the microarray
16
Microarray experiments
Obtain sequence info
select oligos
Print microarray
Print or buy the microarray
sequencing errorassembly errorcontamination
uniquesimilar hybridization rates
17
Microarray experiments
Obtain sequence info
select oligos
Print microarray
obtain tissue sample
extract RNA
extract mRNA
label
normalize mRNA
Print or buy the microarray Create the labeled samples
18
Microarray experiments
Obtain sequence info
select oligos
Print microarray
obtain tissue sample
extract RNA
extract mRNA
label
normalize mRNA
Print or buy the microarray Create the labeled samples
experimental design-number of biological replicates-technical replicatesblockssample pooling
19
Microarray experiments
Obtain sequence info
select oligos
Print microarray
obtain tissue sample
extract RNA
extract mRNA
label
normalize mRNA
Print or buy the microarray Create the labeled samples
hybridize
20
Microarray experiments
Obtain sequence info
select oligos
Print microarray
obtain tissue sample
extract RNA
extract mRNA
label
normalize mRNA
Print or buy the microrray Create the labeled samples
hybridize
hybridization design (multichannel)
21
Microarray experimentshybridize scan
detect spots
compute spot summary
detect background
detect bad spots
process image
remove array specific noise
22
Microarray experimentshybridize scan
detect spots
compute spot summary
detect background
detect bad spots
spot detection software
pixel mean, median ...background correction
detection limitbackground > foregroundbadly printed spotsflaws
process image using multiple scans
remove array specific noise normalization
23
Rafael A Irizarry,Department of
Biostatistics [email protected]
http://www.biostat.jhsph.edu/~ririzarr
http://www.bioconductor.org
nci 2002
Spot Detection
Adaptive segmentation Fixed circle segmentation
---- GenePix
---- QuantArray
---- ScanAnalyze
Spot uses morphological opening
24
Gene Expression Microarray experimentsobtain numerical summary for each gene or exon on each array
sampleclassification
clustering genesand samples
differential expression analysis
25
Gene Expression Microarray experimentsobtain numerical summary for each gene or exon on each array
sampleclassification
clustering genesand samples
t-tests, ANOVABayesian versions of aboveFourier analysis of time seriesFalse discovery and nondiscovery rates
differential expression analysis
robust methods to downweight outliersdata imputation (if needed)
discriminant analysissupport vector machinessupervised learning
unsupervised learninghierarchical clusteringk-means clusteringheatmaps
26
A heatmap
samples of different regions of the brain in humans and chimpanzees
sample clusters show that different regions of the brain cluster more closely than different species
gene clusters show that some genes differentiateamong brain regions while other differentiate the 2 species ☺
27
SNP Microarray experimentsobtain numerical summary for each SNP
estimate SNP frequency
haplotyping
association with subpopulation
28
SNP Microarray experimentsobtain numerical summary for each SNP
estimate SNP frequency
haplotyping
association with subpopulation
binomial distribution
determine which sets of SNPs comefrom each of the 2 chromosomes
association with disease, ecotype, etcmultivariate analysismixture models
29
Tiling Microarray experimentsobtain numerical summary for each codon on the chromosome
visualization
30
Tiling Microarray experimentsobtain numerical summary for each codon on the chromosome
visualization
nonparametric smoothing
31
p >> n
n=#samples
Usually, we have some type of response on the samples which may be quantitative (e.g. body mass index, HDL) or categorical (cancer type, growth stage ...)
p=#measurements
which may be the intensity per gene, exon, locus, promoter region ...
32
p >> n
n=#samples
Call the response Y typically a n x 1 vector (e.g. BMI)
p=#measurements
Call the measurements X, an n x p matrix
33
p >> n
n=#samples
Call the response Y typically a n x 1 vector (e.g. BMI)
p=#measurements
Call the measurements X, an n x p matrix.
Typically we might think that Y is related to a small set of the X measures, e.g. k<<n of the p variables.
e.g. Y=U
where U is the n x k matrix of measurements, is an unknown vector of constants andis random.
34
p >> nn=#samples
Call the response Y typically a n x 1 vector (e.g. BMI)
p=#measurements
Call the measurements X, an n x p matrix.
Typically we might think that Y is related to a small set of the X measures, e.g. k<<n of the p variables.
e.g. Y=U
where U is the n x k matrix of measurements, is an unknown vector of constants andis random.
If we try to solve Y=Xand X has rank n, we will always find an exact solution.
In fact, if we select any submatrix of columns of X with rank n, we will always find an exact solution, even if those columns are completely independent of U.
35
p >> n
Another approach is to try to predict using 1 column of X at a time.
If none of the columns are in U (so that the corresponding coefficients are 0), then, if we do any statistical test for =0 and reject for p-value <, we will reject p of the tests and conclude that the corresponding spots are associated with Y.
Because usually p>10000, we will make a lot of mistakes unless is extremely small.
36
p >> n
All of the special methods for analysis of gene expression data are developed to solve the p >> n problem.
37
p >> ne.g. 2-sample problem
2 conditions: e.g. cancer normalFor gene (exon, locus ...) i, we have n samples with p genes and
observe
Yijk i = gene id j= condition id k=sample id
Usual method: 2-sample t-test:
nS
nS
YYt
normalicanceri
normalicanceri
2,
2,
,,*
38
p >> ne.g. 2-sample problem
Some ideas to improve selection ofdifferentially expressed genes:1) Force all genes to have the same variability (2-way ANOVA) by
the normalization step.2) assume that there is a distribution of gene means known in
advance or estimated from the data (Bayesian or empirical Bayes methods).
3) Use the data to estimate the number of inference errors.4) Force the data to be normally distributed (within gene) in the
normalization step or use bootstrap or permutation methods (suitable for fairly large sample size).
nS
nS
YYt
normalicanceri
normalicanceri
2,
2,
,,*
39
Unsolved Problems
People are still working on normalization, differential expression analysis, clustering and classification for gene expression arrays. There are also problems in combining data from other sources including measurements from other platforms, meta-analysis, and data from the literature.
These problems are not dead, but it will be increasingly difficult to find new problems without a paradigm change.
The new arrays (exon, SNP, tiling) will need more new methodology.
40
AcknowledgementsFrancesca Chiaromonte
Floral Genome Project dePamphilis Lab Ma Lab Carlson LabMcNellis LabPugh LabFedorov LabTony HuaXianyun Mao
Bioinformatics Consulting CenterHuck Institute Diya Zhang Wenlei Liu Qing Zhang
Allison Lab (UAB)
Top Related