RNA-Seq - · PDF fileWhat is RNA-seq? • RNA-seq is the high-throughput sequencing of the...

37
RNA-Seq Francesco Favero 27626: Next Generation Seqeuncing Analysis CBS - DTU

Transcript of RNA-Seq - · PDF fileWhat is RNA-seq? • RNA-seq is the high-throughput sequencing of the...

RNA-SeqFrancesco Favero

27626: Next Generation Seqeuncing Analysis CBS - DTU

What is RNA-seq?

• RNA-seq is the high-throughput sequencing of the cDNA

• It’s used to measure the RNA expression

• It’s the NGS equivalent of microarray gene-expression

RNA-seq applications

• Discovery

• new transcript

• transcript boundaries

• splice junctions

• Comparison (between different samples)

• evaluate gene expression

• evaluate difference in splice patterns, isoform abundance.

44 Revolution – RNA-Seq – PCR-free – Ribo-Seq – CLIP-Seq – Normalization - FFPE

RNA families

RNA

Coding

PolyAmRNA

Non-PolyAmRNA

Non-coding

Structural

DNA associated

Replisome

DNA Repair

Telomeric

DNA methylation

(piRNA)

RNA associated

Ribosome associated rRNA

Regulatory

Micro RNA

TSS associated

Anti-sense

Enhancer RNA

RNA-seq and poly-A

• RNA-seq preparation protocols usually includes poly-A selection.

• tentative to remove rRNA

RNA-seq and poly-A

• RNA-seq preparation protocols usually includes poly-A selection.

• tentative to remove rRNA

• not only the mRNA appears to be poly-adenylated

• Always look at the used library preparation protocol (other approach are possible eg.: rRNA depletion kit)

RNA-seq vs microarray

• Microarray: • Pro:

• Costs, well established methods, small data

• Cons:

• Hybridization bias, sequence must be known

• RNAseq • Pro:

• Reproducible (no replicate needed), real transcriptome

• Information rich - not limited to expression -

• Cons:

• Complexity (need a lot of step to have actual results)

• Size and computational power

RNA-seq vs microarray

Marioni J C et al. Genome Res. 2008;18:1509-1517

RNA-seq vs microarray

Marioni J C et al. Genome Res. 2008;18:1509-1517

Differential expressed genes called by microarray and RNA-seq

Alignment methods

• Two different approach are possible:

• Align vs the transcriptome

• faster, easier

• Align vs the whole genome

• the complete information

Alignment tools

• NGS common alignment program:

• BWA

• Bowtie (Bowtie2)

• Novoalign

• Take into account splice-junction

• Tophat/Cufflinks

Transcriptome assembly

Alternative splicing

Alternative splicing is a normal biological phenomenon. !One gene can encode different protein, by changing the combination of transcribed exons

De novo Assembly

• Transcriptomic content is more changeable then DNA genomic content

• Isoforms, alternative splicing.

• gene fusion

• Mapping reads on reference genome is unable to cope with such structural alterations.

• De novo transcriptome assembly

De novo Assembly

• Underlying assumptions relative to RNA-expression

• sequence coverage is similar in reads of the same transcript

• strand specific (sense and antisense transcripts)

• Assemblers:

• Velvet (Genomic and transcriptomic)

• Trinity (Transcriptomic)

• Cufflinks (Transcriptominc, reassemble pre-aligned transcripts to find alternative splicing based on differential expression)

RNA-seq and “reads”

• Reads, counts, call them as you wish. The number of reads for region reflect the expression level.

• Different way to consider the reads

• each reads = 1 count

• FPKM (fragment per kilobase of exon per million)

FPKM

• The aim of FPKM is to deal with the fact that most reads will map to several transcripts. Each read influence the FPKM values of all these transcripts, but will not augment each count by one.

• FPKM is calculated by software like Cufflinks (http://cufflinks.cbcb.umd.edu/)

• Is not possible to converting FPKM back to reads

• length transcript times FPKM != reads. Each read match more transcripts

• Useful to compare abundance of different transcript within the same sample

• Might be able to detect alternative splicing

reads-count

• Considering reads we need to be sure that the alignment is unambiguous.

• Software like HTSeq counts reads discarding ambiguous or not-unique match.

HTSeq

Differential expression

• Reads obtained from different samples

• Compare reads for the same transcript/gene in the various samples

• Challenge:

• Annotation

• Statistics

Challenges

• Annotation

• Alignment to the transcriptome (transcript_id).

• Alignment to the reference genome

• use a GTF to map the desired features type into the genome (HTSeq uses that)

• Statistics

• R/Bioconductor (edgeR, DESeq... more)

Statistics

• A series of observations can be associated with a distribution function.

• Generally the most correct function is described by a binomial or a Poisson distribution.

• The advantage of using a distribution is that given few parameter (eg: size and mean) we can describe the whole data

Statistics

• The advantage of using a distribution is that given few parameter (eg: size and mean) we can describe the whole data

Statistics

• Fitting RNA-seq data in a pure Poisson distribution, the observed variance would results higher then expected (Overdispersion)

• A negative-binomial distribution, is a similar to a Poisson distribution with higher variance.

• neg. binom. is implemented in several R packages, it is a better fit in counts model, like RNA-seq case

Statistics

http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm

Statistics

Poisson and Neg. binomial parameterhttp://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm

Statistics

Poisson and Neg. binomial parameterhttp://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm

Additional negative binomial parameter. when overdispersion = 0

neg. binom = Poisson

Statistics

• Normalization:

• Different sample have different number of total reads (library size)

• Normalization for library size (each package implement a different method)

edgeR

• From the author of limma (linear model microarray)

• Negative-binomial distribution

• Normalize for size (Normalization-factor)

edgeR

edgeR

Ensemble gene ID

edgeR

Log Fold Change log(rgene_iDHT) - log(rgene_iControl)

edgeR

log Counts per Million convertible to RPKM/FPKM

edgeR

Statistical scores

edgeR

RNA-seq

Thanks!