RNA-seq: Quantifying the Transcriptome
-
Upload
norman-small -
Category
Documents
-
view
220 -
download
2
description
Transcript of RNA-seq: Quantifying the Transcriptome
RNA-seq: Quantifying the Transcriptome
Alisha Holloway, PhDGladstone Bioinformatics Core Director
What is RNA-seq?
Use of high-throughput sequencing technologies to assess the RNA content of a sample.
Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript
usage• Discover novel genes/transcripts,
gene fusions• Profile transcriptome• Ribosome profiling to measure
translation
Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript
usage• Discover novel genes/transcripts,
gene fusions• Profile transcriptome• Ribosome profiling to measure
translation
Skelly et al. 2011
Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript
usage• Discover novel genes/transcripts,
gene fusions• Profile transcriptome• Ribosome profiling to measure
translation
Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript
usage• Discover novel genes/transcripts,
gene fusions• Profile transcriptome• Ribosome profiling to measure
translation
Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript
usage• Discover novel genes/transcripts,
gene fusions• Profile transcriptome• Ribosome profiling to measure
translation
Pluripotent Stem Cell
CardiomyocytesCardiogenicMesoderm
Cardiac Precursors
Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript
usage• Discover novel genes/transcripts,
gene fusions• Profile transcriptome• Ribosome profiling to measure
translation
More tomorrow!
Ingolia et al. 2009, Weissman Lab
RNA-seq MicroarrayID novel genes, transcripts, & exons
Well vetted QC and analysis methods
Greater dynamic range Well characterized biasesLess bias due to genetic variation
Quick turnaround from established core facilities
Repeatable Currently less expensiveNo species-specific primer/probe designMore accurate relative to qPCRMany more applications
RNA-seq vs. Affy RNA-seq vs. Taqman
Marioni et al. 2008 © 2010 NuGen
Illumina Pac-BioRead length 100 bp paired end 2500 bp avg
Throughput 200 million read pairs/lane
1 million reads/ SMRT cell
Error rate <1% 15% total, most are indels, 4% SNP
Cost $600/sample $7-8k/sampleAccessibility USCF, UC-Davis, BGI No commercially
available protocolsUses DE, ASE, quant alt.
transc. usageCharacterize transcriptome
When to use Pac-Bio
Plan it well.
• Experimental design– Biological replicates– Reference genome?– Good gene annotation?
• Read depth• Barcoding• Read length• Paired vs. single-end
Technicalvariation
Biologicalvariation
Plan it well.
• Experimental design– Biological replicates– Reference genome?– Good gene annotation?
• Read depth• Barcoding• Read length• Paired vs. single-end
Plan it well.
• Experimental design– Biological replicates– Reference genome?– Good gene annotation?
• Read depth• Barcoding• Read length• Paired vs. single-end
How much data do we need?
• ~15-20K genes expressed in a tissue | cell line.• Genes are on average 3KB• For 1x coverage using 100 bp reads, would
need 600K sequence reads• In reality, we need MUCH higher coverage to
accurately estimate gene expression levels.• 50 million reads
Plan it well.
• Experimental design– Biological replicates– Reference genome?– Good gene annotation?
• Read depth• Barcoding• Read length• Paired vs. single-end 200 million reads / lane
Run 4 samples / lane
Plan it well.
• Experimental design– Biological replicates– Reference genome?– Good gene annotation?
• Read depth• Barcoding• Read length• Paired vs. single-end
Uniq seq = 4read length
Read length Unique seq
25 1.1x1015
50 1.3x1030
100 1.6x1060
~60 million coding bases in vertebrate genome
Plan it well.
• Experimental design– Biological replicates– Reference genome?– Good gene annotation?
• Read depth• Barcoding• Read length• Paired vs. single-end
Paired-end!
• Effectively doubles read length – huge impact on read mapping
• Increases number of splice junction spanning reads
• Critical for estimating transcript-level abundance
The wet lab side…briefly
How do you make sense of this pile of data?
• QC• Alignment• Expt: Compare two groups– Transcript Assignment & Abundance– Differential Expression
• Expt: Allele-specific expression
QC
• FastQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
• Proportion of reads that mapped uniquely– Remove duplicates; likely due to PCR amp.
• Assess ribosomal RNA content• Assess content of possible contaminants –
human RNA (if not human samples), Mycoplasma (if cell lines)
Then what?
• Align reads to the genome– Easy(ish) for genomic sequence– Difficult for transcripts with splice junctions
Alignment Algorithms
• Burrows-Wheeler Transform– Bowtie (Langmead et al 2009)– BWA (Li and Durbin 2009)– SOAP2 (Li et al. 2009)
• Smith-Waterman– BFAST (Homer at al. 2009, based on BLAT) – multiple indexes, finds candidate
alignment locations using seed and extend, followed by a gapped Smith-Waterman local alignment for each candidate
http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
Alignment tools for splice junction mapping
• Tophat • MapSplice• SpliceMap• HMMsplicer
Tophat
• Map reads to transcriptome using Bowtie• Map to genome to discover novel exons– or start here if no annotation available
• Split reads to smaller segments; map to genome to discover novel splice junctions
• Report best alignment for each read
Trapnell et al. Bioinformatics 2009; Trapnell et al. Nature Protocols 2012
MapSplice & SpliceMap
Wang et al. NAR 2010, Au et al. NAR 2010
• Tag alignment (user chooses aligner)– Break reads into segments– Map reads– Unmapped segments considered for splice
junction mapping based on location of partner segment
– Merge segments from read for final alignment• Assess splice junction quality
HMMsplicer
• Remove reads that map contiguously• Hidden markov model to detect exon
boundary of remaining reads• Compute intensive• Reference annotation not used• Best for compact genomes• User sets threshold for accepting splice
junction.Dimon et al. PLoS One 2010
HMMsplicer
Martin & Wang, Nature Reviews Genetics 2011
Transcript Assignment/Abundance
Transcript Assignment &|Abundance Tools
• For DE:– Cufflinks– MISO– Scripture – not maintained
• De novo assembly– Cufflinks– Trans-ABySS– Trinity– Maker
Cufflinks
• Constructs the parsimonious set of transcripts that explain the reads observed. Basically, finds a minimum path cover on the DAG.
• Derives a likelihood for the abundances of a set of transcripts given a set of fragments.
• FPKM – fragments per kb of exon per million fragments mapped.
Trapnell, Pachter
MISO
• Mixture of Isoforms• Bayesian – treats expression level of set of
isoforms as random variable and estimates a distribution over the values of this variable.
• Gives confidence intervals for expression estimates and measures of DE as Bayes factors
Burge Lab @ MIT
Bias Correction and Normalization• Random hexamer bias (Hansen
et al. 2010)– From PCR or RT primers– Reestimate FPKM or read counts
based on bias• Upper quartile normalization
(Bullard et al. 2010)– excellent resource for
comparison to qPCR and microarray as well as methods of normalization of RNA-seq data
Differential Expression
• Goal: determine whether observed difference in read counts is greater than would be expected due to random variation.
• If reads independently sampled from population, reads would follow multinomial distribution appx by Poisson
Differential Expression
• BUT! We know that the count data show more variance than expected
• Overdipersion problem mitigated by using the negative binomial distribution, which is determined by mean and variance
Sample j, gene i
Differential Expression
• Binomial test– Old Cuffdiff
• Negative binomial– DESeq – estimate variance using all genes with
similar expression levels– Cuffdiff – sim to DESeq, but incorp fragment
assignment uncertainty simultaneously– EdgeR - moderate variance over all genes– T-test
Differential Expression
Old cuffdiff
Some biology, finally?
• How have gene expression patterns have changed during the course of differentiation?
• Which genes are specific to certain cell types?• What can we learn about what those co-
expressed genes do?
Clusters of co-expressed genes
• Use unsupervised clustering to group genes by expression pattern
• Use gene ontology information to determine which kinds of genes are in each group
• Reveal novel associations and gene types
Clusters of co-expressed genes
Pluripotency/stem cell: Nanog, Oct4
Mesoderm/cell fate commitment: Mesp1, Eomes
Cardiac precursors: Isl1, Mef2c, Wnt2
Cardiac structure/function: Actc1, Ryr2, Tnni3