RNA-seq: Quantifying the Transcriptome

42
RNA-seq: Quantifying the Transcriptome Alisha Holloway, PhD Gladstone Bioinformatics Core Director

description

What is RNA-seq? Use of high-throughput sequencing technologies to assess the RNA content of a sample.

Transcript of RNA-seq: Quantifying the Transcriptome

Page 1: RNA-seq: Quantifying the Transcriptome

RNA-seq: Quantifying the Transcriptome

Alisha Holloway, PhDGladstone Bioinformatics Core Director

Page 2: RNA-seq: Quantifying the Transcriptome

What is RNA-seq?

Use of high-throughput sequencing technologies to assess the RNA content of a sample.

Page 3: RNA-seq: Quantifying the Transcriptome

Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript

usage• Discover novel genes/transcripts,

gene fusions• Profile transcriptome• Ribosome profiling to measure

translation

Page 4: RNA-seq: Quantifying the Transcriptome

Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript

usage• Discover novel genes/transcripts,

gene fusions• Profile transcriptome• Ribosome profiling to measure

translation

Skelly et al. 2011

Page 5: RNA-seq: Quantifying the Transcriptome

Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript

usage• Discover novel genes/transcripts,

gene fusions• Profile transcriptome• Ribosome profiling to measure

translation

Page 6: RNA-seq: Quantifying the Transcriptome

Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript

usage• Discover novel genes/transcripts,

gene fusions• Profile transcriptome• Ribosome profiling to measure

translation

Page 7: RNA-seq: Quantifying the Transcriptome

Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript

usage• Discover novel genes/transcripts,

gene fusions• Profile transcriptome• Ribosome profiling to measure

translation

Pluripotent Stem Cell

CardiomyocytesCardiogenicMesoderm

Cardiac Precursors

Page 8: RNA-seq: Quantifying the Transcriptome

Why do an RNA-seq experiment?• Detect differential expression• Assess allele-specific expression• Quantify alternative transcript

usage• Discover novel genes/transcripts,

gene fusions• Profile transcriptome• Ribosome profiling to measure

translation

More tomorrow!

Ingolia et al. 2009, Weissman Lab

Page 9: RNA-seq: Quantifying the Transcriptome

RNA-seq MicroarrayID novel genes, transcripts, & exons

Well vetted QC and analysis methods

Greater dynamic range Well characterized biasesLess bias due to genetic variation

Quick turnaround from established core facilities

Repeatable Currently less expensiveNo species-specific primer/probe designMore accurate relative to qPCRMany more applications

Page 10: RNA-seq: Quantifying the Transcriptome

RNA-seq vs. Affy RNA-seq vs. Taqman

Marioni et al. 2008 © 2010 NuGen

Page 11: RNA-seq: Quantifying the Transcriptome

Illumina Pac-BioRead length 100 bp paired end 2500 bp avg

Throughput 200 million read pairs/lane

1 million reads/ SMRT cell

Error rate <1% 15% total, most are indels, 4% SNP

Cost $600/sample $7-8k/sampleAccessibility USCF, UC-Davis, BGI No commercially

available protocolsUses DE, ASE, quant alt.

transc. usageCharacterize transcriptome

Page 12: RNA-seq: Quantifying the Transcriptome

When to use Pac-Bio

Page 13: RNA-seq: Quantifying the Transcriptome

Plan it well.

• Experimental design– Biological replicates– Reference genome?– Good gene annotation?

• Read depth• Barcoding• Read length• Paired vs. single-end

Technicalvariation

Biologicalvariation

Page 14: RNA-seq: Quantifying the Transcriptome

Plan it well.

• Experimental design– Biological replicates– Reference genome?– Good gene annotation?

• Read depth• Barcoding• Read length• Paired vs. single-end

Page 15: RNA-seq: Quantifying the Transcriptome

Plan it well.

• Experimental design– Biological replicates– Reference genome?– Good gene annotation?

• Read depth• Barcoding• Read length• Paired vs. single-end

Page 16: RNA-seq: Quantifying the Transcriptome

How much data do we need?

• ~15-20K genes expressed in a tissue | cell line.• Genes are on average 3KB• For 1x coverage using 100 bp reads, would

need 600K sequence reads• In reality, we need MUCH higher coverage to

accurately estimate gene expression levels.• 50 million reads

Page 17: RNA-seq: Quantifying the Transcriptome

Plan it well.

• Experimental design– Biological replicates– Reference genome?– Good gene annotation?

• Read depth• Barcoding• Read length• Paired vs. single-end 200 million reads / lane

Run 4 samples / lane

Page 18: RNA-seq: Quantifying the Transcriptome

Plan it well.

• Experimental design– Biological replicates– Reference genome?– Good gene annotation?

• Read depth• Barcoding• Read length• Paired vs. single-end

Uniq seq = 4read length

Read length Unique seq

25 1.1x1015

50 1.3x1030

100 1.6x1060

~60 million coding bases in vertebrate genome

Page 19: RNA-seq: Quantifying the Transcriptome

Plan it well.

• Experimental design– Biological replicates– Reference genome?– Good gene annotation?

• Read depth• Barcoding• Read length• Paired vs. single-end

Paired-end!

• Effectively doubles read length – huge impact on read mapping

• Increases number of splice junction spanning reads

• Critical for estimating transcript-level abundance

Page 20: RNA-seq: Quantifying the Transcriptome

The wet lab side…briefly

Page 21: RNA-seq: Quantifying the Transcriptome

How do you make sense of this pile of data?

• QC• Alignment• Expt: Compare two groups– Transcript Assignment & Abundance– Differential Expression

• Expt: Allele-specific expression

Page 22: RNA-seq: Quantifying the Transcriptome

QC

• FastQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• Proportion of reads that mapped uniquely– Remove duplicates; likely due to PCR amp.

• Assess ribosomal RNA content• Assess content of possible contaminants –

human RNA (if not human samples), Mycoplasma (if cell lines)

Page 23: RNA-seq: Quantifying the Transcriptome

Then what?

• Align reads to the genome– Easy(ish) for genomic sequence– Difficult for transcripts with splice junctions

Page 24: RNA-seq: Quantifying the Transcriptome

Alignment Algorithms

• Burrows-Wheeler Transform– Bowtie (Langmead et al 2009)– BWA (Li and Durbin 2009)– SOAP2 (Li et al. 2009)

• Smith-Waterman– BFAST (Homer at al. 2009, based on BLAT) – multiple indexes, finds candidate

alignment locations using seed and extend, followed by a gapped Smith-Waterman local alignment for each candidate

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

Page 25: RNA-seq: Quantifying the Transcriptome

Alignment tools for splice junction mapping

• Tophat • MapSplice• SpliceMap• HMMsplicer

Page 26: RNA-seq: Quantifying the Transcriptome

Tophat

• Map reads to transcriptome using Bowtie• Map to genome to discover novel exons– or start here if no annotation available

• Split reads to smaller segments; map to genome to discover novel splice junctions

• Report best alignment for each read

Trapnell et al. Bioinformatics 2009; Trapnell et al. Nature Protocols 2012

Page 27: RNA-seq: Quantifying the Transcriptome

MapSplice & SpliceMap

Wang et al. NAR 2010, Au et al. NAR 2010

• Tag alignment (user chooses aligner)– Break reads into segments– Map reads– Unmapped segments considered for splice

junction mapping based on location of partner segment

– Merge segments from read for final alignment• Assess splice junction quality

Page 28: RNA-seq: Quantifying the Transcriptome

HMMsplicer

• Remove reads that map contiguously• Hidden markov model to detect exon

boundary of remaining reads• Compute intensive• Reference annotation not used• Best for compact genomes• User sets threshold for accepting splice

junction.Dimon et al. PLoS One 2010

Page 29: RNA-seq: Quantifying the Transcriptome

HMMsplicer

Page 30: RNA-seq: Quantifying the Transcriptome

Martin & Wang, Nature Reviews Genetics 2011

Transcript Assignment/Abundance

Page 31: RNA-seq: Quantifying the Transcriptome

Transcript Assignment &|Abundance Tools

• For DE:– Cufflinks– MISO– Scripture – not maintained

• De novo assembly– Cufflinks– Trans-ABySS– Trinity– Maker

Page 32: RNA-seq: Quantifying the Transcriptome

Cufflinks

• Constructs the parsimonious set of transcripts that explain the reads observed. Basically, finds a minimum path cover on the DAG.

• Derives a likelihood for the abundances of a set of transcripts given a set of fragments.

• FPKM – fragments per kb of exon per million fragments mapped.

Trapnell, Pachter

Page 33: RNA-seq: Quantifying the Transcriptome

MISO

• Mixture of Isoforms• Bayesian – treats expression level of set of

isoforms as random variable and estimates a distribution over the values of this variable.

• Gives confidence intervals for expression estimates and measures of DE as Bayes factors

Burge Lab @ MIT

Page 34: RNA-seq: Quantifying the Transcriptome

Bias Correction and Normalization• Random hexamer bias (Hansen

et al. 2010)– From PCR or RT primers– Reestimate FPKM or read counts

based on bias• Upper quartile normalization

(Bullard et al. 2010)– excellent resource for

comparison to qPCR and microarray as well as methods of normalization of RNA-seq data

Page 35: RNA-seq: Quantifying the Transcriptome

Differential Expression

• Goal: determine whether observed difference in read counts is greater than would be expected due to random variation.

• If reads independently sampled from population, reads would follow multinomial distribution appx by Poisson

Page 36: RNA-seq: Quantifying the Transcriptome

Differential Expression

• BUT! We know that the count data show more variance than expected

• Overdipersion problem mitigated by using the negative binomial distribution, which is determined by mean and variance

Sample j, gene i

Page 37: RNA-seq: Quantifying the Transcriptome

Differential Expression

• Binomial test– Old Cuffdiff

• Negative binomial– DESeq – estimate variance using all genes with

similar expression levels– Cuffdiff – sim to DESeq, but incorp fragment

assignment uncertainty simultaneously– EdgeR - moderate variance over all genes– T-test

Page 38: RNA-seq: Quantifying the Transcriptome

Differential Expression

Old cuffdiff

Page 39: RNA-seq: Quantifying the Transcriptome

Some biology, finally?

• How have gene expression patterns have changed during the course of differentiation?

• Which genes are specific to certain cell types?• What can we learn about what those co-

expressed genes do?

Page 40: RNA-seq: Quantifying the Transcriptome

Clusters of co-expressed genes

• Use unsupervised clustering to group genes by expression pattern

• Use gene ontology information to determine which kinds of genes are in each group

• Reveal novel associations and gene types

Page 41: RNA-seq: Quantifying the Transcriptome

Clusters of co-expressed genes

Pluripotency/stem cell: Nanog, Oct4

Mesoderm/cell fate commitment: Mesp1, Eomes

Cardiac precursors: Isl1, Mef2c, Wnt2

Cardiac structure/function: Actc1, Ryr2, Tnni3

Page 42: RNA-seq: Quantifying the Transcriptome

Thanks for listening!

Alisha HollowayGladstone InstitutesBioinformatics Core

[email protected]