RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential...
Transcript of RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential...
RNA-seq from a bioinformatics perspective
Harmen van de Werken Erasmus MC;
Cancer Computational Biology Center (CCBC)
Outlook
RNA seq software + RNA-seq courses
Alternative splicing & Promoters
Introduction RNAseq data
Differential expression
Read-Through & Fusion Transcripts
SNV / InDels
Novel Transcripts
Table 6-1 Molecular Biology of the Cell (© Garland Science 2008)
Which type of RNA is most abundant? How many different genes do we have per type?
THE HUMAN GENOME ▪ Consensus ~ 22,500 protein-coding genes
~ 9,000 long non-coding RNAs ~ 2,500 – 3,000 small RNAs
▪ miTranscriptome1 ~ 91,013 genes ~ 58,648 lncRNA genes
1Iyer MK et al. Nature Genetics 47, 199–208 (2015)
Transcriptomics of (Cancer) Tissue
Common mRNA-seq Work flow
Dry lab Bioinformatics
Wetlab
(c)DNA Next Generation Sequencing (NGS)
ThermoFisher Ion Torrent
Personal Genome Machine (PGM)
PACBio RS II Illumina HiSeq 2000
Illumina HiSeq 2000
Illumina Sequencing
Ion Torrent Platform
RNAseq Data Analysis
Alternative splicing & Promoters RNAseq data
Differential expression
Read-Through & Fusion Transcripts
SNV / InDels
Novel Transcripts
Detecting Single Nucleotide Variants and small indels
Errors occur at each stage
Primary Analysis - Incorrect base calling - Homopolymer errors - Phasing
Errors occur at each stage
Secondary Analysis Read mapping - Incorrect ref. Sequence - Pseudogenes - Indels - Complex variants
Errors occur at each stage Secondary Analysis Variant calling
Variant Calling filters are heuristics; therefore, they will generate false negatives and positives and are best applied as soft filters.
Errors occur at each stage
Tertiary Analysis - Incorrect gene annotation - Contamination in reference Databases.
False Negative: c.2237_2259del,insCCAACAAGGAA EGFR
False Negative BRAF p.V600R
RNAseq Data Analysis
Alternative splicing & Promoters RNAseq data
Differential expression
Read-Through & Fusion Transcripts
SNV / InDels
Novel Transcripts
Differential Gene Expression mRNA-seq Work flow
Fig1. FastQC report on Base Quality of position and overrepresented sequences
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
Fig2. Fastq format of one read
mRNA-seq alignment
Courtesy: Wikipedia
mRNA derived cDNA fragments alignment to transcriptome
Alignment to transcriptome
Alignment to reference genome
RNA-Seq - Alignment
Alignment algorithms need: • Reference sequence • Transcriptome database (optional) Algorithms commonly used for RNA-Seq alignment: • Tophat • STAR • HISAT2
Visualization of NGS Transcriptomics and Genomics data
RNA-Seq - Alignment/QC
RNA-Seq - Stranded
Differential expression
Rakesh Kaundal et al.
Normalization of RNA-seq
Total count (TC): Gene counts are divided by the total number of mapped reads Upper Quart ile (UQ): Very similar in principle to TC, the total counts are replaced by the upper quartile of counts Median (Med): Also similar to TC, the total counts are replaced by the median counts Trimmed Mean of M-values (TMM): This normalization method is implemented in the edgeR Bioconductor package (Robinson et al., 2010). Quant ile (Q): First proposed in the context of microarray data, this normalization method consists in matching distributions of gene counts across lanes. Reads Per Kilobase per Million mapped reads (RPKM): This approach quantifies gene expression from RNA-Seq data by normalizing for the total transcript length and the number of sequencing reads.
Reduce Dimensions PCA /QC
Principal components Analysis (PCA) of a multivariate Gaussian distribution. PCA is a linear algorithm. It will not be able to interpret complex polynomial relationship between features.
Reduce Dimensions t-SNE
t-Stochastic Neighbor Embedding (t-SNE) is a non-linear algorithm.
Clustering Analysis/ QC
Fig 1: Example Hierarchical Clustering. Example of hierarchical clustering: clusters are consecutively merged with the most nearby clusters. The length of the vertical dendogram-lines reflect the nearness. (Jansen et al.)
Clustering Analysis/ QC
Differential expression DESeq2
A common difficulty in the analysis of read count data is the strong variance of Log Fold Change (LFC) estimates for genes with low read count.
Differential expression of genes
Test Differentially gene expression with correction for multiple testing
Gene Set Enrichment Analysis: GO and KEGG database
Gene Set Enrichment Analysis: GO and KEGG database
RNAseq Data Analysis
Alternative splicing & Promoters RNAseq data
Differential expression
Read-Through & Fusion Transcripts
SNV / InDels
Novel Transcripts
Fusion Gene Detection
Fig1 RNA-seq mapping of short reads over exon-exon junctions, it could be defined a Trans or a Cis event. (wikipedia)
Fusion Gene Detection
Fusion Gene Detection
Fusion Catcher Tool
Fusion Catcher outperforms other tools by using multiple Aligners ❖ Bowtie ❖ Bowtie2 ❖ BLAT ❖ STAR
RNAseq Data Analysis
Alternative splicing & Promoters RNAseq data
Differential expression
Read-Through & Fusion Transcripts
SNV / InDels
Novel Transcripts
RNA-seq de novo Assembly
❖ Define the whole transcriptome without a reference.
❖ Trinity
RNA-seq Analysis Software
RNA-seq Molmed Courses
❖ Basic Course on 'R' ❖ Galaxy for NGS ❖ Workshop Ingenuity Pathway Analysis (IPA) + CLC
Workbench / Ingenuity Variant Analysis ❖ Gene expression data analysis using R: How to make
sense out of your RNA-Seq/microarray data
Take Home Message
Think before you start
Thank you for your attention
Hematology Mathijs A. Sanders Remco Hoogenboezem
CCBC Job van Riet Wesley van de Geer
[email protected] https://ccbc.erasmusmc.nl
@ ErasmusMC_CCBC
Harmen van de Werken
Cancer Computational Biology Center (CCBC)