Summarization of Oligonucleotide Expression Arrays
description
Transcript of Summarization of Oligonucleotide Expression Arrays
![Page 1: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/1.jpg)
Summarization of Oligonucleotide Expression
Arrays
BIOS 691-803 Winter 2010
![Page 2: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/2.jpg)
What is Summarization?
• Some expression arrays (Affymetrix, Nimblegen) use multiple probes to target a single transcript – a ‘probe set’
• Typically probes have different fold changes between any two samples
• How to effectively summarize the information in a probe set?
![Page 3: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/3.jpg)
Many Probes for One Gene
GeneGeneSequenceSequence
Multiple Multiple oligo probesoligo probes
Perfect MatchPerfect MatchMismatchMismatch
5´5´ 3´3´
How to combine signals from multiple probes into a single gene abundance estimate?
![Page 4: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/4.jpg)
Probe Variation• Individual probes don’t agree on fold
changes• Probes vary by two orders of magnitude
on each chip– CG content is most important factor in signal strength
Signal from 16 probes along one gene on one chip
![Page 5: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/5.jpg)
Probe Measure Variation
•Typical probes are two orders of magnitude different!•CG content is most important factor•RNA target folding also affects hybridization
3x104
0
![Page 6: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/6.jpg)
Bioinformatics Issues
• Probes may not map accurately• SNP’s in probes• Affymetrix places most probes in 3’UTR of
genes– Alternate Poly-A sites mean that some probe
targets may really be less common than others
![Page 7: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/7.jpg)
Probe Mapping
• Early builds of the genome often confused regions or genes and their complements
• Probe sets at right represent probe sets for rRNA gene and its complement
![Page 8: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/8.jpg)
Alternate Poly-Adenylation Sites
Poly-A marks mRNA ‘tail’ Many genes have alternatives 3’ UTR may be longer or shorter
![Page 9: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/9.jpg)
Alternate Polyadenylation of MID1
![Page 10: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/10.jpg)
Many Approaches to Summarization
• Affymetrix MicroArray Suite; PLiER • dChip - Li and Wong, HSPH• Bioconductor:
– RMA - Bolstad, Irizarry, Speed, et al– affyPLM – Bolstad– gcRMA – Wu
• Physical chemistry models – Zhang et al• Factor model• Probe-weighting
![Page 11: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/11.jpg)
Critique of Averaging (MAS5)
• Not clear what an average of different probes should mean
• Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here
• No ‘learning’ based on cross-chip performance of individual probes
![Page 12: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/12.jpg)
Motivation for multi-chip models:
Probe level data from spike-in study ( log scale ) note parallel trend of all probes
Courtesy of Terry Speed
![Page 13: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/13.jpg)
Model for Probe Signal• Each probe signal is proportional to
– i) the amount of target sample – a – ii) the affinity of the specific probe sequence to the target – f
• NB: High affinity is not the same as Specificity– Probe can give high signal to intended target and also to
other transcripts
a1
a2
Probes 1 2 3
chip 1
chip 2 f1 f2 f3
![Page 14: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/14.jpg)
Multiplicative Model
• For each gene, a set of probes p1,…,pk
• Each probe pj binds the gene with efficiency fj
• In each sample there is an amount ai. • Probe intensity should be proportional to
fjxai
• Always some noise!
![Page 15: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/15.jpg)
Robust Linear Models
• Criterion of fit– Least median squares– Sum of weighted squares– Least squares and throw out outliers
• Method for finding fit– High-dimensional search – Iteratively re-weighted least squares– Median Polish
![Page 16: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/16.jpg)
• For each probe set, take log of PMij = ai fj:
• then fit the model:
• where caret represents “after pre-processing”• Fit this additive model by iteratively re-
weighted least-squares or median polish
ijjiijMP )ˆ(log
Bolstad, Irizarry, Speed – (RMA)
Critique: Model assumes probe noise is constant (homoschedastic) on log scale
)log()log()(log jiij faPM
![Page 17: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/17.jpg)
Comparing Measures
20 replicate arrays – variance should be smallStandard deviations of expression estimates on arraysarranged in four groups of genesby increasing mean expression level
Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA
Courtesy of Terry Speed
![Page 18: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/18.jpg)
Background
• 25-mers are prone to cross-hybridization• MM > PM for about 1/3 of all probes• Cross-hybridization varies with GC content• Signal intensity varies with cross-hybe
![Page 19: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/19.jpg)
The gcRMA Approach
• Estimate non-specific binding using either:– True null assay (non-
homologous RNA)– Estimates from MM
• Subtract background before normalization and fitting model
![Page 20: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/20.jpg)
Evaluating gcRMA
• On AffyComp data sets, gcRMA wins– Replicates with 14 spike-ins done by Affy
• Many investigators get crappy results (and don’t write it up)
• gcRMA does very well on highly expressed genes, not nearly so well on less expressed genes
• Gharaibeh et al. BMC Bioinformatics 2008 9:452
![Page 21: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/21.jpg)
Factor Model• Assume relation between p observations x
and true value z: x = z + where i are independent
• Use factor analytic methods to estimate – Depends on assuming z ~ Normal– Differs from RMA in relaxing assumption of
IID errors – some probes can have more random error than others
![Page 22: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/22.jpg)
Weighting Probes• It is clear that some probes are more
reliable than others• How to assess this in a simple fashion?• If a gene really changes across arrays,
then a responsive probe will change more than a noisy probe
• Weight by relative ranges• Best performance on AffyComp!
![Page 23: Summarization of Oligonucleotide Expression Arrays](https://reader035.fdocuments.in/reader035/viewer/2022062315/568160dc550346895dd00bcb/html5/thumbnails/23.jpg)
Summary and Evaluation
• No one best solution for all situations• gcRMA and DFW seem to do very well on
AffyComp data– May need weights for DFW by tissue
• Leading methods seem to rely on probe weighting