Differential Gene Expression Analysis using RNA-Seq...
Transcript of Differential Gene Expression Analysis using RNA-Seq...
Differential Gene Expression Analysis using RNA-Seq Data
RNA-Seq Data
1. Biologists collect mRNA from many cells
2. Cells come from two or more biological samples (different tissues)
RNA-Seq Data Generation
3. Collected mRNA are shredded, size selected, sequenced
Sequenced reads are mapped back to the reference genome
Mapping RNA-Seq Reads
• Mapping a read to a reference genome is finding the position within genome where the read comes from
• Reads containing splice junctions cannot be mapped to a reference genome directly
• Ways to map reads with splice junctions: – Use special algorithms/methods/techniques
– Map the reads to annotated transcriptome
Mapping RNA-Seq Reads
• Example: the read TCAAG occurs at position 10 in the given reference genome (if position count starts with 1)
• Since mRNA are collected from many cells:
– reads cover the entire lengths of exons
– overlapping reads come from different mRNA molecules
Mapping RNA-Seq Reads
Mapping concerns:
1. A read that is mapped to two or more locations in a reference genome is called ambiguous and is discarded from the analysis
2. Two reads are called copy-duplicates if they are mapped to the same start position in the genome (these might be the product of poly-chain reaction, PCR, that is used to make copies of mRNA segments to make sequencing possible). Only one of copy-duplicates is used in the analysis
• The number of reads mapped to a single gene/transcript/exon, read count, is used to estimate differential gene expression
• Given two (or more) samples, find the read count for one sample and for the other sample, and use statistics to infer whether these counts are significantly different
• To estimate a read count for a transcript of a gene is not trivial: – Alternating splicing (if a read is mapped to an
exon shared in two or more transcripts, then we cannot be certain whether the read comes from one transcript or the other)
– Overlapping genes (uncertainty in counting a read that mapped to the region belonging to two or more overlapping genes)
Mapping RNA-Seq Reads
• To estimate a read count for a transcript of a gene is not trivial
• To remedy:
– Estimate read count for each gene or exon instead
– Use reads containing splice junctions
– In some cases, discard the read from the analysis
Mapping RNA-Seq Reads
Workflow of RNA-Seq Differential Gene Expression Analysis
Adapted from “RNA-seq Data Analysis: A Practical Approach” by Eija Korpelainen, Jarno Tuimala, Panu Somervuo, Mikael Huss, Garry Wong Chapman & Hall/CRC Mathematical and Computational Biology
Preprocessing
• Adapters trimming
• Low quality read ends trimming (3’ end)
Adapters Trimming
More on adapter trimming: http://www.ark-genomics.org/events-online-training-eu-training-course/adapter-and-quality-trimming-illumina-data http://training.bioinformatics.ucdavis.edu/docs/2013/02/bootcamp/galaxy/_downloads/qa-and-i.pdf
TOOLS: FASTQC Cutadapt qrqc Scythe
Adapters Trimming
Quality Control: Nucleotide Profile
TOOLS: FASTQC qrqc
Quality Control: Base Quality Profile
Quality Control: k-mer Enrichment
Quality Control: Reads Lengths Distribution after Trimming
Quality Control: Statistics
Mapping RNA-Seq Reads
• Mapping to a reference genome
• Mapping to transcriptome
• Gene annotation information (start/end of exons in known genes)
1. Genes are located on both strands of DNA
2. Reads are always sequenced from 5’ to 3’
3. Mapping is performed to only (+) strand of DNA
4. Map the reverse-complement of a read: ATTGC, rc: GCAAT
Slide 22 of 31
G C A A T C T G G C
Mapping RNA-Seq Reads
Mapping RNA-Seq Reads
Ambiguous Reads (identify and discard)
A read that is mapped with the same (smallest) number of mismatches to two or more locations in the genome
A read that is mapped to both + (positive) and – (negative) strands with the same smallest number of mismatches
Mapping RNA-Seq Reads
Ambiguous:
Unique:
Mapping RNA-Seq Reads
• Sequencing instruments require certain quantity of mRNA
• Poly Chain Reaction produces multiple copies of mRNA segments
• Copies of the same segment are sequenced producing copy duplicates (product of PCR not related to the mRNA abundance in biological sample)
Mapping RNA-Seq Reads
• Two reads are called copy-duplicates if they are mapped to the same start position in the genome (identify and count only one read)
• Copy duplicates can be generated only from the same sample
Mapping RNA-Seq Reads
• Collect mapping statistics: – Total reads that were attempted for mapping
– Total unique reads mapped
– Total ambiguous reads mapped
– Total copy duplicates
– Distribution of reads by mismatches/indels
– Total reads mapped to splice-junctions
– CG-bias in mapped reads
– Depth of coverage
– 3’ end gene bias (more reads mapped to 3’ end)
Counting Reads: HTSeq
Normalization
• Raw read count has to be normalized to enable comparison between samples
• RPKM Reads Per Kilobase and per Million mapped reads
• Total raw reads mapped to a gene divided by the length of the gene in Kilobases and divided by total number of mapped reads in millions
• Sometimes mappable length is used (since ambiguous reads are discarded, repeated regions within genes are not covered by reads)