Introductin to RNAseq & Differential Gene Expression
-
Upload
amit-singh -
Category
Technology
-
view
1.639 -
download
0
Transcript of Introductin to RNAseq & Differential Gene Expression
![Page 1: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/1.jpg)
Introduction to RNA-seq &
Differential Gene Expression
By : Amit Kumar Singh
![Page 2: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/2.jpg)
Next Generation Sequencing data
Next Generation Sequencing: Possibilities
Massive growth in amount of data generation since 2006
• Traditional Sanger Vs Next Generation Sequencing methods: – Reduced cost per base– Reduced sequencing time– Covering wide range of
applications
Comparative costs: sequencing a human genome
Sequencing Cheaper and Fast..Analysis of data complex and time consuming..
![Page 3: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/3.jpg)
Genome
transcriptome
mRNA
other
s
What is a Transcriptome ?
Complete set of all RNA molecues in cell. It includes mRNA, rRNA, tRNA and other non coding RNA.Array of mRNA transcripts produced in a particular cell or tissue type.The study of transcriptomics, also referred toas expression profiling, examines the expression level of mRNAs in a given cell population,
GENOME vs. TRANSCRIPTOMEGenome : Content is fixed Transcriptome : Content is time and cell specific & is much more complex than the genome
![Page 4: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/4.jpg)
NGS Advantages Next-generation sequencing (NGS) of cDNA (RNA-Seq) becomes more widely
adopted for transcriptome profiling.
* Dropping prices and maturing technology are causing NGS as technology of choice
RNA-Seq does not depend on genome annotation Transcript reconstruction – non model organisms. Trascript verification – model organisms RNA-Seq is the method of choice in projects using nonmodel organisms and for
novel transcript discovery and genome annotation. Accurate expression level determination
Cons Current wet-lab RNA-Seq strategies require lengthy library preparation procedures
![Page 5: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/5.jpg)
Different types of RNA
![Page 6: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/6.jpg)
Transcripts and alternate splicing RNA transcript is the code that is copied from the strand of DNA(known as the template strand). mRNA (pre)is the actually strand that carries the code out of the nucleus and into the cytoplasm. This mRNA undergoes with alternate splicing where introns are spliced out.
![Page 7: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/7.jpg)
Transcripts sharing same TSS or CDS
![Page 8: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/8.jpg)
• Sequencing based method to study transcriptome
• Use of Next-Generation Sequencing (NGS) technology to measure RNA levels
• Generating and sequencing ‘reads’ from cDNA
• Mapping reads to reference genome
• Quantification of assembled reads.
What is RNAseq ?
![Page 9: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/9.jpg)
Experiment design : Replicates
Technical Replicates: measure quantity from one source.
Eg : 5 samples from single patient suffering from lung cancer
Biological Replicates : measure a quantity from different sources under the same conditions.
Eg: 5 Samples, each from 5 different patients suffering from lung
cancer Use of replicates– Minimize experimental variation or artifacts– Improving results by averaging out – More the data, more robust the statistical test
and Results are more statistically significant
![Page 10: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/10.jpg)
Analysis tools Read alignment Transcript assembly or genome annotation Transcript and gene quantification
![Page 11: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/11.jpg)
RNA-Seq analysis pipeline for detecting
differential expression
![Page 12: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/12.jpg)
An overview of RNAseq for Differential gene expression
![Page 13: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/13.jpg)
Tuxedo Pipeline for RNAseq analysis
![Page 14: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/14.jpg)
MappingObjective: To find the unique location where a short read is identical to the referenceReality: Reference is never a perfect representation of the actual biological source of RNA being sequencedSample-specific attributes like SNPs and indels; short reads align perfectly to multiple locations and can contain sequencing errorsReal task is to find the location where each short read best matches the reference allowing for errors and structural variation
![Page 15: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/15.jpg)
Problem in mapping of reads spanning splice junctions
These reads are alignedOn Reference Genome.
![Page 16: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/16.jpg)
![Page 17: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/17.jpg)
Splice junction aligners break junction Reads and index the information
![Page 18: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/18.jpg)
Mapping: ChallengesMultimaps: Reads that map equally well to several locations
Multi-maps treatment-Discard multimaps
Paired-end reads reduce the problem of multi-mapping
![Page 19: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/19.jpg)
• Splice junction mapper• Initial mapping onto genome
(exons) by bowtie, an ultrafast short read aligner
• Builds database of possible splice junction
• Maps unmapped reads against the junctions
• Also ; splits the unmapped reads into smaller fragment to map on exons.
TopHat algorithm
![Page 20: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/20.jpg)
Input to know : GTF file• GTF : Gene transfer format• Reference GTF file is collection of every transcript (genes
and its isoforms + non-coding RNA transcripts)• Available with genome databases ENSEMBL, UCSC, RefSeq
Sample Ref.GTF file format
chr source
start endstrand Attributes of transcripts
![Page 21: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/21.jpg)
Mapping with TophatHow to use !
Tophat which is a splice junction aligner. At the backend it uses bowtie for mapping of short reads on genome.
Bowtie which uses an extremely economical data structure
called the FM index to store the reference genome sequence and
allows it to be searched rapidly.
Indexing of Reference Genome:
Eg : The referece genome is chr19.fa. Indexing of Reference Genome is done by bowtie2 utility – bowtie2-build.
bowtie2-build <Ref genome fasta> <prefix>
[user]$ bowtie2-build chr19.fa chr19
![Page 22: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/22.jpg)
Tophat commands
(i)Mapping without using reference annotation
[user]$ tophat chr19 reads1.fastq reads2.fastq
(ii) Mapping with using reference annotation
It uses referece annotation (GTF) for known splice junction location
for better mapping.
[user]$ tophat -G chr19.gtf chr19 reads1.fastq reads2.fastq
(iii) Mapping only to the reference annotation
[user]$ tophat -G chr19.gtf –no-novel-juncs chr19 reads1.fastq reads2.fastq
Note :The Gene transfer format (GTF) is a file format used to hold information about gene structure. Eg
![Page 23: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/23.jpg)
New feature :Mapping on transcriptome: You can even map your reads directly on transcriptome with this new feature of
tophat. When providing TopHat with a known transcript file (-G/--GTF option above), a
transcriptome sequence file is built Bowtie then creates the index for this new transcriptome sequences Reads are then aligned these known transcripts (First time)
[user]$ tophat -o output_sample1 -G chr19.gtf --transcriptome-index=transcriptome/known chr19 sample1_1.fastq sample1_2.fastq
Once the transcriptome index is formed, there is not need to specify -G option next
time if you want to run tophat for other samples (Next time mapping on
transcriptome)
[user]$ tophat -o output_sample2 --transcriptome-index=tran/known chr19 sample1_1.fastq sample1_2.fastq
![Page 24: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/24.jpg)
Output of Tophat
1. accepted_hits.bam. A list of read alignments in BAM format.
2. junctions.bed. A UCSC BED track of junctions reported by TopHat. The score is the number of alignments spanning the junction.
![Page 25: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/25.jpg)
Alignments are reported in BAM files
BAM is the compressed, binary version of SAM, a flexible and general purpose read alignment format.
Many downstream analysis tools accept SAM and BAM as input.
There are also numerous utilities for viewing and manipulating SAM and BAM files. Perhaps The most popular among these is the SAMtools.
![Page 26: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/26.jpg)
Read name
flag
chromosome
position
Mapping quality
cigar
= means PE
Name of mate
Insert len
seq
Quality val
CIGAR string (describes the position of insertions/deletions/matches in the alignment, encodes splice junctions, for example)For more information : http://samtools.sourceforge.net/samtools.shtml
SAM Format
![Page 27: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/27.jpg)
Bed format
chr
Start & End Jun ID Optinal fields
Junctions View on IGV
![Page 28: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/28.jpg)
Analysis with samtools(i) View the BAM file[user]$ samtools view accepted_hits.bam(ii) Convert the BAM file into non binary SAM file[user]$ samtools view accepted_hits.bam > accepted_hits.sam(iii) Count the number of lines of sam file[user]$ wc -l accepted_hits.sam
(iv) sorting of SAM file[user]$ samtools sort accepted_hits.bam outprefix
(v) Indexing of BAM file [user]$ samtools index accepted_hits.bam
(VI) Knowing the statistics of BAM file[user]$ samtools flagstat accepted_hits.bam
![Page 29: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/29.jpg)
Cufflinks
Cufflinks to generate a transcriptome assembly for each sample. Cufflinks assembles individual transcripts from RNA-seq reads that have been aligned to the genome.
![Page 30: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/30.jpg)
Normalization• More reads mapped to a transcript if it is
-long -At higher depth of coverage
(high expression)
• Normalize such that
Features of different lengths of different conditions can be compared
• Need for Normalization:To reduce bias within the sample or between different sample conditions
• FPKM is one such normalization strategy adopted by cufflinks.
![Page 31: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/31.jpg)
C= the number of reads mapped onto the gene's exonsN= total number of reads in the experimentL= the sum of the exons in base pairs.
FPKM=109×CNL
• Cufflink estimates the abundance values in FPKM (fragments per kilobase of transcript per million mapped fragments )• Cufflinks ensure that expression levels for different genes and transcripts can be compared across runs by FPKM values.• FPKM is a measure of how many reads have been recorded for each transcript normalized by transcript length and the total number of reads.
FPKM
![Page 32: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/32.jpg)
Visualizing data on IGV
Intronic regions Exonic regions
Nonconding exonic region
![Page 33: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/33.jpg)
Read mapping onGene in BAM file
![Page 34: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/34.jpg)
Cuffmerge
The assemblies generated by cufflinks are then merged together using the Cuffmerge utility (An function of cufflinks package).
This merged assembly provides a uniform basis for calculating gene and transcript expression in each condition
Command :[user]$cuffmerge -s genome/chr19.fa -g chromosome10.gtf assembly_GTF.list Where assemby_GTF.list , a text file contains path of all cufflinks assemblies you want to merge.
![Page 35: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/35.jpg)
Cuffdiff : Protocol to estimate differential gene expression !!!
Calculates expression levels and tests the statistical significance of observed changes.
Fisher’s test Estimates log2 fold change log2( FPKMB /FPKMA ) Cuffdiff reports numerous output files containing the results of its differential
analysis of the samples. These files contain statistical values such as fold change, P values, gene and
transcript features such as commonname and location in the genome and the FPKM values for each feature.
Command :[user]$ cuffdiff merged_asm/merged.gtf sample1/tophat_out/accepted_hits.bamsample2/tophat_out/accepted_hits.bam
![Page 36: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/36.jpg)
gene_exp.diff
FPKM values Significant gene
![Page 37: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/37.jpg)
Healthy tissueHealthy tissueDiseased tissueDiseased tissueref genome A (.fa file)ref genome A (.fa file)
tophattophat tophattophat
Healthy_hits.bam
Healthy_hits.bam
Diseased hits.bam
Diseased hits.bam
cufflinkscufflinks cufflinks cufflinks
Ref A.gtfRef A.gtf
Transcripts_healthy.gtfTranscripts_healthy.gtf Transcripts_diseased.gtfTranscripts_diseased.gtf
cuffmerge cuffmerge
Healthy_disease.merged.gtfHealthy_disease.merged.gtf
cuffdiffcuffdiff
Gene.diffGene.diff
ENSEMBL
ENSEMBL
![Page 38: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/38.jpg)
Samples
Tophat
Cufflinks
Cuffdiff
Expression Analysis
DESeq (R package)
Fold change
HTSeq
Tools used in RNAseq excercise
![Page 39: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/39.jpg)
Geneset enrichment analysisby DAVID
Identification of GO Terms that are significantly overrepresented in the given set of genelist.
Hypergeometric statistical test is performed to identify such terms.
Simple Example : Let Your statistically significant gene list = 694 (Each gene associated with GO
Terms) Total genes in organism = 10,738 Total genes with cell division GO term biological process in organism = 634
Hypergeometric test will predict (with its statistical values for confidence): Out of 694 genes 107 genes have cell division GO term (Biological process) which is over represented
You can conclude that there is cell division which is altered between normal and treatment sample.
![Page 40: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/40.jpg)
Submit your genelist
![Page 41: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/41.jpg)
Annotation summary results
![Page 42: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/42.jpg)
![Page 43: Introductin to RNAseq & Differential Gene Expression](https://reader034.fdocuments.in/reader034/viewer/2022052321/554e8d4db4c90573338b4bb4/html5/thumbnails/43.jpg)
Questions ??
THANK YOU