TOX680 Unveiling the Transcriptome using RNA- seq

32
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu

description

TOX680 Unveiling the Transcriptome using RNA- seq. Jinze Liu. Outline. What is the transcriptome? Measuring the transcriptome Sampling the transcriptome using short reads Alignment of reads to a reference genome Splice graph representation of RNA- seq data - PowerPoint PPT Presentation

Transcript of TOX680 Unveiling the Transcriptome using RNA- seq

Page 1: TOX680 Unveiling  the Transcriptome using  RNA- seq

TOX680Unveiling the Transcriptome using RNA-seq

Jinze Liu

Page 2: TOX680 Unveiling  the Transcriptome using  RNA- seq

Outline

• What is the transcriptome?• Measuring the transcriptome• Sampling the transcriptome using short reads• Alignment of reads to a reference genome• Splice graph representation of RNA-seq data• Reconstructing the transcriptome• Differential analysis of the transcriptome

Page 3: TOX680 Unveiling  the Transcriptome using  RNA- seq

Genome, Transcriptome, ProteomeSchematic illustration

of a eukaryotic cell

cell nucleus

DNA

RNA

The transcriptome isall RNA molecules

transcribed from DNA

Proteins

Genome

Proteome

Page 4: TOX680 Unveiling  the Transcriptome using  RNA- seq

Dynamics of the Transcriptome• Cells with the same genome may produce a different transcriptome … how?

• Two main mechanisms(1) differential gene expression (2) differential gene transcription

DNA

Proteins

mRNA transcripts

DNA

mRNA

pre-mRNA

Proteins

Page 5: TOX680 Unveiling  the Transcriptome using  RNA- seq

Alternate transcription• multiple mRNA transcript “isoforms” within one gene

– proteins with different functions may be produced– e.g. skipped exon in CYT-2 isoform of ERBB4 leads to increased cell proliferation

Muraoka-Cook et al. (2009) Mol Cell Biol

CYT-2: deletes 16 amino acids (WW domain binding motif)

Page 6: TOX680 Unveiling  the Transcriptome using  RNA- seq

Forms of alternative splicing

Castle et al. (2008) Nature Genetics

Gene VEGFA combines multiple alternative splicing forms (not independently!) ….

2 2 23 3

Page 7: TOX680 Unveiling  the Transcriptome using  RNA- seq

How to measure the transcriptome?

• Ideally, given a sample of RNA– which transcripts are present?– how much of each?

• Given two samples of RNA– which transcripts are differentially expressed?

Page 8: TOX680 Unveiling  the Transcriptome using  RNA- seq

Microarrays• Most common technique for measuring

transcriptome

– hybridized probes detect the presence and abundance of specific known transcripts

• difficult to observe differenttranscript isoforms

• abundance has limited dynamic range

Page 9: TOX680 Unveiling  the Transcriptome using  RNA- seq

Differential gene expression

• Identify transcriptome differences between two samples

Page 10: TOX680 Unveiling  the Transcriptome using  RNA- seq

Outline

• What is the transcriptome• Measuring the transcriptome• Sampling the transcriptome using short reads• Alignment of reads to a reference genome• Splice graph representation of RNA-seq data• Reconstructing the transcriptome• Differential analysis of the transcriptome

Page 11: TOX680 Unveiling  the Transcriptome using  RNA- seq

The RNA-seq protocol

Nature Review | Genetics

• Protocol– mRNA is reverse transcribed to cDNA– cDNA is randomly fragmented– adapters are added to the fragments– fragments are sequenced using HT

sequencing technology• e.g. Illumina: up to a billion 100bp

reads sequenced in a single run

• Each sequence is a randomly sampled fragment of the transcriptome

– identity determined by alignment to a transcript library or to a reference genome

– the number of alignments toa genomic locus is a measure ofabundance

Page 12: TOX680 Unveiling  the Transcriptome using  RNA- seq

RNA-seq view of transcriptome

• Issues– non-random fragmentation– sequencing bias– DNA or pre-mRNA contamination

• Spliced alignments– not a problem if aligning to a transcript library– challenging if aligning to the genome

Page 13: TOX680 Unveiling  the Transcriptome using  RNA- seq

Outline

• What is the transcriptome• Measuring the transcriptome• Sampling the transcriptome using short reads• Alignment of reads to a reference genome• Splice graph representation of RNA-seq data• Reconstructing the transcriptome• Differential analysis of the transcriptome

Page 14: TOX680 Unveiling  the Transcriptome using  RNA- seq

Spliced alignment strategies• Annotation based discovery

– contiguous alignment of reads to existing EST/cDNA sequences with known splice junctions– contiguous alignment of reads to paired exons from database of known or suspected

junctions (Mortazavi et al. 2008, Wang et al. 2008)

• Ab initio discovery by alignment to reference genome– QPalma (Bona et al. 2008)

• supervised splice site prediction and gapped alignment algorithm for aligning spliced reads

– TopHat (Trapnell et al. 2009)• detect potential junctions based on structural features of introns, e.g. GT – AG

dinucleotide sequences flanking the exons• test alignment of reads to candidate exon pairs

Page 15: TOX680 Unveiling  the Transcriptome using  RNA- seq

Improved splice detection• Issues

– Can not easily find non-canonical splices or long-range splices– Single long reads may include multiple splice junctions– Spurious alignment is a serious problem

• MapSplice: a second generation ab initio method– alignment of reads

• does not depend on any structural features• finds multiple candidate alignments

– splice inference• leverages the quality and diversity of read alignments to disambiguate

true junctions from spurious junctions– efficient and scalable

Page 16: TOX680 Unveiling  the Transcriptome using  RNA- seq

Finding spliced alignments

Genome

mRNA tag Tt1 t2 t3 t4

k k hj1 j2

exon 1 exon 2 exon 3

• Example: 100 bp tag T is split into 25bp segments– segments are tested for (approximate) alignment to the genome– unaligned segments implicate splices– find splices by searching from neighboring aligned segments

• Theorem: if no exon is shorter than 2k, then at least one segment must align in every pair of consecutive length k segments.

Page 17: TOX680 Unveiling  the Transcriptome using  RNA- seq

MapSplice algorithm (1)

Ti …t1 t2 tj tn

…(1) Segmentation of reads

tj tj+2

? tj+1

tj tj+1

3’5’

Contiguous

Missed alignment double anchored

tj? tj+1 Missed alignment

single anchored

tj+2

tj tj+2

? tj+1

tj

? tj+1

s(j+1)

3’5’

(2) Segment exonic alignment (3) Segment spliced alignment

tj tj+1 3’5’

T1

T2

Ti

INPUTSset of RNA-Seq reads

Reference genome

Page 18: TOX680 Unveiling  the Transcriptome using  RNA- seq

MapSplice algorithm

(2)

OUTPUTS: Splices and splice coverage

Read alignments

3’5’

t1 t2tj tj+1 tn… …

tn-1

Ti … …

(4) Segment assembly

1. Alignment quality2. Anchor significance3. Entropy

High Confidence Low confidence

Ti2 TiTi3 Ti4

(5) Junction inference

Ti

(6) Identify best alignment for tags

Ti

3’5’

3’5’

Page 19: TOX680 Unveiling  the Transcriptome using  RNA- seq

Validating the algorithm

• How can we tell if it is working well?– comparison against transcriptome library alignment

– but how do we know that novel alignments are valid?• run on synthetic transcriptome for which we know ground

truth!

BWAidentically

aligned80.4%

BWA aligned

only1.2%

MapSplicealigned

only5.0% /6.8%

unaligned 10.2%

by both81.4%

MPS

Page 20: TOX680 Unveiling  the Transcriptome using  RNA- seq

Synthetic Transcriptome1. Sample each gene’s ABUNDANCE from Wang et al. (2008)

2. Choose a DISTRIBUTION across annotated transcript isoforms in RefSeq

3. Randomly pick the START position for each read (& introduce errors)

4. Align reads with MapSplice and analyze performance.

Page 21: TOX680 Unveiling  the Transcriptome using  RNA- seq

MapSplice performance

Page 22: TOX680 Unveiling  the Transcriptome using  RNA- seq

Improved accuracy from multiple criteria in junction classification

Page 23: TOX680 Unveiling  the Transcriptome using  RNA- seq

Outline

• What is the transcriptome• Measuring the transcriptome• Sampling the transcriptome using short reads• Alignment of reads to a reference genome• Splice graph representation of RNA-seq data• Reconstructing the transcriptome• Differential analysis of the transcriptome

Page 24: TOX680 Unveiling  the Transcriptome using  RNA- seq

• Transcriptome changes in response to time, disease, etc• Characteristics of a transcriptome

• Qualitatively, which transcripts are expressed• Quantitatively, what are their expression levels

Splicing Ratio

3 41 2

Protein β

transcript β3 41

Transcript Abundance

Protein Expression

Protein α

transcript α41 2 3

Page 25: TOX680 Unveiling  the Transcriptome using  RNA- seq

• Transcriptome changes in response to time, disease, etc• Differential Splicing: alternative splicing events that exhibit significantly

different splicing ratios between different samples

Splicing Ratio

3 41 2

Protein β

transcript β3 41

Transcript Abundance

Protein Expression

41 2 3

Protein α

transcript α3 41

Protein β

transcript β

3 41 2

Protein α

transcript α41 2 3

Normal TumorDifferential Splicing

Page 26: TOX680 Unveiling  the Transcriptome using  RNA- seq

• Differential Splicing: why important?• Understanding of cell differentiation and development• Identification of disease biomarkers

Splicing Ratio

3 41 2

Protein β

transcript β3 41

Transcript Abundance

Protein Expression

41 2 3

Protein α

transcript α3 41

Protein β

transcript β

3 41 2

Protein α

transcript α41 2 3

Normal TumorDifferential Splicing

Page 27: TOX680 Unveiling  the Transcriptome using  RNA- seq

Observed read coverage

A1

A2

B1

B2

Gro

up A

Gro

up B

Splice structure E1 E2 E3 E4 E5J1 J2

J3

J4

J5Unify structural information (exons and junctions) from all samples

DiffSplice – Unified Graph Representation

RNA-seq read alignment

Reference genome5’ 3’

Page 28: TOX680 Unveiling  the Transcriptome using  RNA- seq

Splice structure

Unified Expression-weighted Splice Graph (ESG)

Weighted DAG (Directed Acyclic Graph)• Vertex – Exonic segment• Edge – Splice junction• Weight – Expression level

A1

B1B2

Gro

up A

Gro

up B

A2

E1

94.9

56.1

83.7

62.2

J1

91

57

84

64

E2

95.2

55.7

88.1

65.6

E1 E2 E3 E4 E5TS TE

J1 J2

J3

J4

J5

E1 E2 E3 E4 E5J1 J2

J3

J4

J5

Differentiate samples by the weights

DiffSplice – Unified Graph Representation

Page 29: TOX680 Unveiling  the Transcriptome using  RNA- seq

source sink

E1 E2 E3

J1 J2

J3

ASM1

E1 E3immed. pre-dominator

immed. post-dominator E3 TE

immed. pre-dominator

immed. post-dominator

source sink

ASM2

E3 E4 E5 TE

J4

J5

ESG

ASM

E1 E2 E3 E4 E5TS TE

J1 J2

J3

J4

J5

DiffSplice – Alternative Splicing Modules (ASMs)

Page 30: TOX680 Unveiling  the Transcriptome using  RNA- seq

source sink

E1 E2 E3

J1 J2

J3

ASM1

source sink

ASM2

E3 E4 E5 TE

J4

J5

ESG

ASM

E1 E2 E3 E4 E5TS TE

J1 J2

J3

J4

J5

path 1

path 2

path 1

path 2

Level 0

Level 1

ASM1 ASM2

DiffSplice – Alternative Splicing Modules (ASMs)

Page 31: TOX680 Unveiling  the Transcriptome using  RNA- seq

E1 E2 E3

J1 J2

J3

path 1

path 2

?

?

(?%)

(?%)

91

92.1

93

94.93

95.2E1 E2 E3

J1 J2

J3

ASM1 in sample A1path 1

path 2

observed expression

estimated expression

DiffSplice – Isoform Abundance Estimation

N, q

w(E1) w(E2) w(E3) w(J1) w(J2) w(J3)

T1 T2

Poisson dist’n

Normal dist’n

Page 32: TOX680 Unveiling  the Transcriptome using  RNA- seq

96.7% 3.3%

alternative path proportion

estimated expression of ASM1

95.1

JE

qqs

swPmaxargˆ

T JE

q qt s

tPoissontNormal NTfTtswf

,|||maxarg

ASM1 in sample A1

observed expression

estimated expression

DiffSplice – Isoform Abundance Estimation

E1 E2 E3

J1 J2

J3

path 1

path 2

92.0

3.1

(96.7%)

(3.3%)

91

92.1

93

94.93

95.2E1 E2 E3

J1 J2

J3

path 1

path 2