Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

52
Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett

Transcript of Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Page 1: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

CufflinksMatt Paisner, Hua He, Steve Smith and

Brian Lovett

Page 2: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

The Vision• RNAseq can be used for transcript discovery and

abundance estimation• What’s missing: algorithms which aren’t

restricted by prior gene annotations (which are often incomplete) and account for alternative transcription and splicing.

• Hence, Cufflinks.

Page 3: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

The Need• Evidence of ambiguous assignment of isoforms.

TSS site/promoter changes and splice site changes were found previously by the authors

• Longer reads and pair end reads do not do enough

Page 4: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

The Biology• General assumption of randomization of reads• Central Dogma• Transcription Start Site (TSS)• Splice site• Isoform

Page 5: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Central Dogma and Regulation

Page 6: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Splicing•

Thisblahblahblahblahblahblahisblahblahimportant•

Thisblahblahblahblahblahblahisblahblahimportant• “This” “is” “important” - Exons• “blah” - Introns (Intrusions)

Page 7: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Major Change 1

Page 8: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Major Change 2

Page 9: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Why it matters: Isoforms

• Not only different sizes, but different shapes• Shape determines function• Isoforms would map to the same section of the

genome: undetected without Cufflinks• Separating transcripts into isoforms elucidates a

more realistic representation of what is happening

Page 10: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.
Page 11: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

TopHatMapping short reads

Trapnell et. al, Bioinformatics, 2009

Page 12: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

TopHat• No genome reference annotations are needed

• The output of TopHat is the input of Cufflinks.

• Input: Reads and genome

• Output: Read mappings

• Short reads present computational challengeso BOWTIE

Page 13: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

How does TopHat Work?!

Big Idea: “Exon Inference”!!

Page 14: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Step 1: Initial Mapping via Bowtie

• Group 1: Mapped Reads (Segments)• Group 2: Initially Unmapped (IUM) Reads

o possibly intron-spanning read

• Based on Group 1, we want to get intron-spanning reads from Group 2

Reference

Mapped Reads

Page 15: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Step 2: Generate Putative Exons

Page 16: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Step 3: Look for Potential Splice Signals

Putative Exons

Page 17: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Step 4: Seed-and-Extend

Page 18: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.
Page 19: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

CufflinksIsoform/Transcript Detection and

Quantification

Trapnell et al, Nature Biotech, 2010

Page 20: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Step 5: Identify Compatible Reads

Two reads are compatible if their overlap contains the exact same implied introns (or none). If two reads are not compatible they are incompatible.

Page 21: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Step 6:

Less BIOLOGY, and NOW it is the time for some GRAPH THEORIES…….

“We emphasize that the definition of a transcription locus is not biological……” - Authors

Page 22: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Step 6: Create Overlap Graph

Connect compatible reads in order

Create a DAG

Page 23: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

A path in this graph correspondsto a transcript isoform

Page 24: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Theory1. Solving minimum path cover (isoforms) in the

overlap graph implies the fewest transcripts necessary to explain the reads.

2. Solve minimum path cover by finding largest set of individual reads such that no two are compatible.

3. According to Dilworth Thereom, find a maximum matching in a bipartite graph

Page 25: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Step 8: Convert a DAG into a Bipartite Graph

Page 26: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Step 9: Looking for Maximum Matching

inside a bipartite graph via Bipartite

Matching Algorithm

BIPARTITE-MATCHING Algorithm: Add augmenting path via BFS, repeatedly adding the paths into the matching until none can be added.

Page 27: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

A path in this graph correspondsto a transcript isoform

Page 28: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

28

Projective normalization underestimates expression

isoform aisoform b project all isoforms

into genome coordinates

R reads total, r reads for the gene:- ra for isoform a- rb for isoform b

but so

Page 29: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

29

How should expression levels be estimated?

• A-B are distinguished by the presence of splice junction (a) or (b).

• A-C are distinguished by the presence of splice junction (a) and change in UTR

• B-C are distinguished by the presence of splice junction (b) and change in UTR

(a)(b)

Page 30: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

30

How should expression levels be estimated?

• Longer transcripts contain more reads.• Reads that could have originated from multiple

transcripts are informative.• Relative abundance estimation requires

“discriminatory reads”.

(a)(b)

Page 31: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

31

A model for RNA-Seq

• = r Transcript proportions for assignment of reads to transcripts

• L = Likelihood of this assignment

• R = all reads

Page 32: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

32

A model for RNA-Seq

• = T All transcripts

Page 33: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

33

A model for RNA-Seq

Define:

• Expected possible positions for an arbitrary fragment in Transcript t

• F(i) = pr(random fragment has length i)

• l(t) = Full length of transcript t

Page 34: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

34

A model for RNA-Seq

Page 35: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

35

A model for RNA-Seq

• It (r) = Implied length of r’s fragment if r is assigned to transcript t

• Recall: F(i) = pr(fragment length = i)

Page 36: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

36

Projective normalization underestimates expression

isoform aisoform b project all isoforms

into genome coordinates

R reads total, r reads for the gene:- ra for isoform a- rb for isoform b

but so

Page 37: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

37

A model for RNA-Seq

• Now we have a maximum likelihood function in terms of , r the distribution of reads among transcripts.

• Non-negative linear model

Page 38: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

38

Inference with the sequencing model

• Maximum likelihood function is concave - optimization using the EM algorithm.

• Asymptotic MLE theory leads to a covariance matrix for the estimator in the form of the inverse of the observed Fisher information matrix

• Importance sampling from the posterior distribution used for estimating the abundances from the posterior expectation, and 95% confidence intervals for the estimates.

• This approach extends the log linear model of H. Jiang and W. Wong, Bioinformatics 2009 to a linear model for paired end reads.

• For more background see Li et al., Bioinformatics, 2010 and Bullard et al., BMC Bioinformatics, 2010.

Page 39: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Utility of Cufflinks• mRNA as proxy for gene expression & action• Control points

o transcriptional vs o post transcriptional

• Does isoform-level discovery & quantification matter? o Apparently, yeso Putatively discovered about 12K new isoforms while recovering about

13K knowno Plus other stuff…

Page 40: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

40

The skeletal myogenesis transcriptomeRNA-Seq (2x75bp GAIIx) along time course of mouse C2C12 differentiation

-24 hours

60 hours

168 hours

differentiation(starting at 0 hours)

fusion

myotubemyoctyte

120 hours

Illustration based on: Ohtake et al, J. Cell Sci., 2006; 119:3822-3832

•84,369,078 reads

•140,384,062reads

• 82,138,212reads

•123,575,666reads

•66,541,668alignments

•103,681,081alignments

•47,431,271alignments

•89,162,512alignments

•10,754,363to junctions

•19,194,697to junctions

•9,015,806to junctions

•17,449,848to junctions

•58,008transfrags

•69,716transfrags

•55,241transfrags

•63,664transfrags

Slide courtesy of Hector Corrada Bravo

Page 41: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Validation

Page 42: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Validation

Page 43: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

43

Projective normalization underestimates expression

isoform aisoform b project all isoforms

into genome coordinates

R reads total, r reads for the gene:- ra for isoform a- rb for isoform b

but so

Slide courtesy of Hector Corrada Bravo

Page 44: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

44

Discovery is necessary for accurate abundance

estimates

Slide courtesy of Hector Corrada Bravo

Page 45: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Some Questions…• Do isoforms of a given gene have interesting

temporal patterns?o Increasing, decreasing, more complex…

• What does this mean biologically?• What about transcriptional versus post

transcriptional regulation?o Differential transcriptiono Differential splicing

Page 46: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

46

Dynamics of Myc expression

Slide courtesy of Hector Corrada Bravo

Page 47: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Overloading Metric using Jensen-Shannon Divergence

Metric:

One-sided t-test under the null hypothesis that there is no difference in abundance;Type I errors controlled with Benjamini-Hotchberg correction (FDR)

Average EntropyEntropy of Average

Page 48: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Regulatory Overloading

Differential splicing

Differential TSS preference

231

101

17

FibronectinTropomyosin 1Mef2d…

Fhl3Fhl1Myl1…

# Genes (FDR < 0.05)

Page 49: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

49

Dynamics of Myc expression

d( , )

Slide courtesy of Hector Corrada Bravo

Page 50: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

New TSS = New Points of Regulation

TSS=Transcription Start Site

What would a “collapsed” RNA-seq alignment look like? Microarray?

Page 51: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Questions?

Page 52: Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

I am the DNA, and Iwant a protein!

The DNA wants a protein.

Transcription

Translation

mRNAProtein