Interrogating the transcriptome in all its diversity Joel H Graber.

download Interrogating the transcriptome in all its diversity Joel H Graber.

If you can't read please download the document

Transcript of Interrogating the transcriptome in all its diversity Joel H Graber.

  • Slide 1
  • Interrogating the transcriptome in all its diversity Joel H Graber
  • Slide 2
  • Empirical transcript measurements to characterize mRNA processing Stop codon PolyA sites ESTs Microarray probes mRNA-SeqmRNA-RACE mRNA/cDNA
  • Slide 3
  • The most important thing I can tell you about (especially large-scale) transcriptome measurement EVERY procedural step can leave its mark on the data True of both bench and computational steps What looks like interesting processing may not be Know your assumptions Be suspicious Test and torture your data to be confident
  • Slide 4
  • Small numbers of genes to test: qPCR
  • Slide 5
  • IGF1 mRNA data indicates at least 15 or more transcript isoforms
  • Slide 6
  • qPCR Primer Pair Set-up should catch most isoform differences
  • Slide 7
  • Igf1 transcript variants are differentially expressed as a function of strain, tissue, and nutrition
  • Slide 8
  • Many genes simultaneously I: microarrays The fundamental hypothesis of transcriptome measurement: the state of the cell can be ascertained by which transcripts are expressed at which levels
  • Slide 9
  • Using gene expression microarrays to assess variation in mRNA processing Our modified hypothesis of transcript measurement The activity of a cell is a function of which isoforms of which genes are expressed Expression arrays, though designed for abundance measurement, can also reveal isoform variation Summarization to one expression level is problematic
  • Slide 10
  • Identifying processing changes with expression arrays: a simple example for illustration One gene with two isoforms Differing only in polyA site Regulatory sites for post-transcriptional control only in extended isoform Microarray probes hybridize to common and differential regions
  • Slide 11
  • Standard microarray analysis can mask changes in mRNA processing Sample 1 Sample 2 21 Sample 1 Sample 2 Salisbury J et al, PLoS ONE 2009
  • Slide 12
  • Statistical Test I: Modified t-test Compute expression ratio for three or more probes on each side of putative break Significance is assessed by randomizing probes Spuriously low variance can cause problems
  • Slide 13
  • Systematic alternative processing of 3-UTR can be correlated with functional changes Science 2008 Cell 2009 Cancer Research 2009
  • Slide 14
  • The Ube2a long transcript isoform is lost in all three tumor-types
  • Slide 15
  • The Pik3ap1 long transcript isoform is preserved in APC and APN, but not LPC tumors
  • Slide 16
  • Summary: tumors have systematic and characteristic changes in RNA processing Our data supports alternate 3-processing as the source of changes rather than isoform-specific in stability While truncation dominates, genes with elongation are also observed APC and LPC tumors share an amplified oncogene (Myc), but differ in signature and prognosis Singh P et al, Cancer Research 2009
  • Slide 17
  • FIRMA: a method for detection of alternative splicing from exon array data E. Purdom, K. M. Simpson, M. D. Robinson, J. G. Conboy, A. V. Lapuk and T.P. Speed Bioinformatics 2008 24(15):1707-1714 Differential splicing using whole- transcript microarrays M. D. Robinson and T.P. Speed BMC Bioinformatics 2009, 10:156
  • Slide 18
  • FIRMA/FIRMAGene details FIRMA score: Full model: i: array index J: gene index k: probe index Residual: FIRMAGene score: Exon arrays Gene arrays
  • Slide 19
  • RMA decomposition Total normalized expressionEstimated probe effect Estimated chip effectResidual Heart Brain 75:25 B:H
  • Slide 20
  • Copyright restrictions may apply. Purdom, E. et al. Bioinformatics 2008 24:1707-1714; doi:10.1093/bioinformatics/btn284 Validation with known muscle-specific exons
  • Slide 21
  • Open questions, unresolved issues of FIRMA approaches Genetic variability Probes that change hybridization due to sequence variation Low expression probes Overlapping probes Unresponsive/hyperresponsive probes Interpretation IRLS is unsupervised; majority can define normal
  • Slide 22
  • CDFs (Chip definition files) MATTER!!!!!! The CDF maps probes on the array to putative genes/transcripts CDFs are explicitly dependent on the quality of the annotations used for association Which CDF is best can literally depend on the specific gene of interest
  • Slide 23
  • IGF1 annotations (and genomic extent) depend greatly on the data source ~83,000 nt
  • Slide 24
  • Even after you get the CDS right, its still good to understand the limitations of your array
  • Slide 25
  • Exon-gene array differences Exon 4 probes per PSR (probe selection region) A mix of probable and improbable transcribed regions Gene Mostly probable regions ~25 probes per targeted transcript
  • Slide 26
  • Comparison of gene (FIRMAgene) and gene (FIRMA) arrays for MBP Brain Muscle
  • Slide 27
  • Measuring isoform variation with mRNAseq
  • Slide 28
  • Slide 29
  • Analysis of large sets of short sequence reads is a rapidly developing field Alignment first, assembly later MAQ Eland SHRiMP BowTie/TopHat SOAP Assembly first, alignment later Trinity transABySS Oases
  • Slide 30
  • High throughput sequence data is aligned in conceptually the same way as BLAST Better heuristics are necessary The problem is bigger Program tweaks are different Tuned to small read size, large genome (Rapid) Indexing is still the key Initial attempts were based on standard hashing, later on Burrows-Wheeler Transform
  • Slide 31
  • Alignment first processing strategy
  • Slide 32
  • Standard reporting of short reads: RPKM
  • Slide 33
  • A principal benefit of mRNAseq: novel exon/isoform discovery Mortazavi et al
  • Slide 34
  • Better mapping to splice junctions Align first to genome Remove perfect matches from further consideration (Good idea?) Remainder are aligned to a broadened set of possible splice junctions: Extract ~25 bases from each exon, join and use as target Standard, annotated splices Additional possible splices
  • Slide 35
  • RNA-seq analysis: alignment
  • Slide 36
  • RNA seq analysis II: identifying isoforms
  • Slide 37
  • RNAseq analysis III: visualization and interpretation
  • Slide 38
  • Assessing the ability to identify alternative isoforms
  • Slide 39
  • Trinity assembles reads to transcripts first Grabherr et al, Nature Biotech 2011
  • Slide 40
  • Sequences in a de Bruijn graph
  • Slide 41
  • Open questions/problems Dealing with the length bias RPKM does not correctly normalize; Optimal alignment Still in development Paralogs and common motifs are a problem Depth of coverage for isoform characterization Capture or focused chemistry helps
  • Slide 42
  • Summarization to total counts leads to false positives
  • Slide 43
  • Recent work I: Dealing with systematic bias in RNAseq data Sources of bias Fragmentation random priming Papers of note: Biases in Illumina transcriptome sequencing caused by random hexamer priming Hansen et al Nucleic Acids Research Volume38, Issue12 p. e131 Hansen et al Nucleic Acids ResearchVolume38, Issue12 Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq Wu et al, Bioinformatics 10.1093/bioinformatics/btq696
  • Slide 44
  • Recent work 2: Focused Sequencing of PolyA sites Standard mRNAseq does not adequately sample polyA sites Recent directed studies: Formation, regulation and evolution of Caenorhabditis elegans 3UTRs Jan et al, Nature Volume: 469, Pages: 97101 Comprehensive polyadenylation site maps in yeast and human reveal pervasive alternative polyadenylation. Ozsolak et al, cell 143(6):1018-29 (2010)
  • Slide 45