RNA-seq: Quantifying the Transcriptome

RNA-seq: Quantifying the Transcriptome

Alisha Holloway, PhDGladstone Bioinformatics Core Director

What is RNA-seq?

Use of high-throughput sequencing technologies to assess the RNA content of a sample.

Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript

usage• Discover novel genes/transcripts,

gene fusions• Profile transcriptome• Ribosome profiling to measure

translation




translation

Skelly et al. 2011




translation




translation

Pluripotent Stem Cell

CardiomyocytesCardiogenicMesoderm

Cardiac Precursors




translation

More tomorrow!

Ingolia et al. 2009, Weissman Lab

RNA-seq MicroarrayID novel genes, transcripts, & exons

Well vetted QC and analysis methods

Greater dynamic range Well characterized biasesLess bias due to genetic variation

Quick turnaround from established core facilities

Repeatable Currently less expensiveNo species-specific primer/probe designMore accurate relative to qPCRMany more applications

RNA-seq vs. Affy RNA-seq vs. Taqman

Marioni et al. 2008 © 2010 NuGen

Illumina Pac-BioRead length 100 bp paired end 2500 bp avg

Throughput 200 million read pairs/lane

1 million reads/ SMRT cell

Error rate <1% 15% total, most are indels, 4% SNP

Cost $600/sample $7-8k/sampleAccessibility USCF, UC-Davis, BGI No commercially

available protocolsUses DE, ASE, quant alt.

transc. usageCharacterize transcriptome

When to use Pac-Bio

Plan it well.

• Experimental design– Biological replicates– Reference genome?– Good gene annotation?

• Read depth• Barcoding• Read length• Paired vs. single-end

Technicalvariation

Biologicalvariation

Plan it well.



How much data do we need?

• ~15-20K genes expressed in a tissue | cell line.• Genes are on average 3KB• For 1x coverage using 100 bp reads, would

need 600K sequence reads• In reality, we need MUCH higher coverage to

accurately estimate gene expression levels.• 50 million reads

Plan it well.


• Read depth• Barcoding• Read length• Paired vs. single-end 200 million reads / lane

Run 4 samples / lane

Plan it well.



Uniq seq = 4read length

Read length Unique seq

25 1.1x1015

50 1.3x1030

100 1.6x1060

~60 million coding bases in vertebrate genome

Plan it well.



Paired-end!

• Effectively doubles read length – huge impact on read mapping

• Increases number of splice junction spanning reads

• Critical for estimating transcript-level abundance

The wet lab side…briefly

How do you make sense of this pile of data?

• QC• Alignment• Expt: Compare two groups– Transcript Assignment & Abundance– Differential Expression

• Expt: Allele-specific expression

QC

• FastQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• Proportion of reads that mapped uniquely– Remove duplicates; likely due to PCR amp.

• Assess ribosomal RNA content• Assess content of possible contaminants –

human RNA (if not human samples), Mycoplasma (if cell lines)

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Then what?

• Align reads to the genome– Easy(ish) for genomic sequence– Difficult for transcripts with splice junctions

Alignment Algorithms

• Burrows-Wheeler Transform– Bowtie (Langmead et al 2009)– BWA (Li and Durbin 2009)– SOAP2 (Li et al. 2009)

• Smith-Waterman– BFAST (Homer at al. 2009, based on BLAT) – multiple indexes, finds candidate

alignment locations using seed and extend, followed by a gapped Smith-Waterman local alignment for each candidate

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software%23Short-Read_Sequence_Alignment

Alignment tools for splice junction mapping

• Tophat • MapSplice• SpliceMap• HMMsplicer

Tophat

• Map reads to transcriptome using Bowtie• Map to genome to discover novel exons– or start here if no annotation available

• Split reads to smaller segments; map to genome to discover novel splice junctions

• Report best alignment for each read

Trapnell et al. Bioinformatics 2009; Trapnell et al. Nature Protocols 2012

MapSplice & SpliceMap

Wang et al. NAR 2010, Au et al. NAR 2010

• Tag alignment (user chooses aligner)– Break reads into segments– Map reads– Unmapped segments considered for splice

junction mapping based on location of partner segment

– Merge segments from read for final alignment• Assess splice junction quality

HMMsplicer

• Remove reads that map contiguously• Hidden markov model to detect exon

boundary of remaining reads• Compute intensive• Reference annotation not used• Best for compact genomes• User sets threshold for accepting splice

junction.Dimon et al. PLoS One 2010

HMMsplicer

Martin & Wang, Nature Reviews Genetics 2011

Transcript Assignment/Abundance

Transcript Assignment &|Abundance Tools

• For DE:– Cufflinks– MISO– Scripture – not maintained

• De novo assembly– Cufflinks– Trans-ABySS– Trinity– Maker

Cufflinks

• Constructs the parsimonious set of transcripts that explain the reads observed. Basically, finds a minimum path cover on the DAG.

• Derives a likelihood for the abundances of a set of transcripts given a set of fragments.

• FPKM – fragments per kb of exon per million fragments mapped.

Trapnell, Pachter

MISO

• Mixture of Isoforms• Bayesian – treats expression level of set of

isoforms as random variable and estimates a distribution over the values of this variable.

• Gives confidence intervals for expression estimates and measures of DE as Bayes factors

Burge Lab @ MIT

Bias Correction and Normalization• Random hexamer bias (Hansen

et al. 2010)– From PCR or RT primers– Reestimate FPKM or read counts

based on bias• Upper quartile normalization

(Bullard et al. 2010)– excellent resource for

comparison to qPCR and microarray as well as methods of normalization of RNA-seq data

Differential Expression

• Goal: determine whether observed difference in read counts is greater than would be expected due to random variation.

• If reads independently sampled from population, reads would follow multinomial distribution appx by Poisson


• BUT! We know that the count data show more variance than expected

• Overdipersion problem mitigated by using the negative binomial distribution, which is determined by mean and variance

Sample j, gene i


• Binomial test– Old Cuffdiff

• Negative binomial– DESeq – estimate variance using all genes with

similar expression levels– Cuffdiff – sim to DESeq, but incorp fragment

assignment uncertainty simultaneously– EdgeR - moderate variance over all genes– T-test


Old cuffdiff

Some biology, finally?

• How have gene expression patterns have changed during the course of differentiation?

• Which genes are specific to certain cell types?• What can we learn about what those co-

expressed genes do?

Clusters of co-expressed genes

• Use unsupervised clustering to group genes by expression pattern

• Use gene ontology information to determine which kinds of genes are in each group

• Reveal novel associations and gene types

Clusters of co-expressed genes

Pluripotency/stem cell: Nanog, Oct4

Mesoderm/cell fate commitment: Mesp1, Eomes

Cardiac precursors: Isl1, Mef2c, Wnt2

Cardiac structure/function: Actc1, Ryr2, Tnni3

Thanks for listening!

Alisha HollowayGladstone InstitutesBioinformatics Core

[email protected]

RNA-seq: Quantifying the Transcriptome

Documents

Transcript of RNA-seq: Quantifying the Transcriptome