Differential Gene Expression Analysis using RNA-Seq...

Differential Gene Expression Analysis using RNA-Seq Data

RNA-Seq Data

1. Biologists collect mRNA from many cells

2. Cells come from two or more biological samples (different tissues)

RNA-Seq Data Generation

3. Collected mRNA are shredded, size selected, sequenced

Sequenced reads are mapped back to the reference genome

Mapping RNA-Seq Reads

• Mapping a read to a reference genome is finding the position within genome where the read comes from

• Reads containing splice junctions cannot be mapped to a reference genome directly

• Ways to map reads with splice junctions: – Use special algorithms/methods/techniques

– Map the reads to annotated transcriptome


• Example: the read TCAAG occurs at position 10 in the given reference genome (if position count starts with 1)

• Since mRNA are collected from many cells:

– reads cover the entire lengths of exons

– overlapping reads come from different mRNA molecules


Mapping concerns:

1. A read that is mapped to two or more locations in a reference genome is called ambiguous and is discarded from the analysis

2. Two reads are called copy-duplicates if they are mapped to the same start position in the genome (these might be the product of poly-chain reaction, PCR, that is used to make copies of mRNA segments to make sequencing possible). Only one of copy-duplicates is used in the analysis

• The number of reads mapped to a single gene/transcript/exon, read count, is used to estimate differential gene expression

• Given two (or more) samples, find the read count for one sample and for the other sample, and use statistics to infer whether these counts are significantly different

• To estimate a read count for a transcript of a gene is not trivial: – Alternating splicing (if a read is mapped to an

exon shared in two or more transcripts, then we cannot be certain whether the read comes from one transcript or the other)

– Overlapping genes (uncertainty in counting a read that mapped to the region belonging to two or more overlapping genes)


• To estimate a read count for a transcript of a gene is not trivial

• To remedy:

– Estimate read count for each gene or exon instead

– Use reads containing splice junctions

– In some cases, discard the read from the analysis


Workflow of RNA-Seq Differential Gene Expression Analysis

Adapted from “RNA-seq Data Analysis: A Practical Approach” by Eija Korpelainen, Jarno Tuimala, Panu Somervuo, Mikael Huss, Garry Wong Chapman & Hall/CRC Mathematical and Computational Biology

Preprocessing

• Adapters trimming

• Low quality read ends trimming (3’ end)

Adapters Trimming

More on adapter trimming: http://www.ark-genomics.org/events-online-training-eu-training-course/adapter-and-quality-trimming-illumina-data http://training.bioinformatics.ucdavis.edu/docs/2013/02/bootcamp/galaxy/_downloads/qa-and-i.pdf

TOOLS: FASTQC Cutadapt qrqc Scythe

http://www.ark-genomics.org/events-online-training-eu-training-course/adapter-and-quality-trimming-illumina-data























http://training.bioinformatics.ucdavis.edu/docs/2013/02/bootcamp/galaxy/_downloads/qa-and-i.pdf





Adapters Trimming

Quality Control: Nucleotide Profile

TOOLS: FASTQC qrqc

Quality Control: Base Quality Profile

Quality Control: k-mer Enrichment

Quality Control: Reads Lengths Distribution after Trimming

Quality Control: Statistics


• Mapping to a reference genome

• Mapping to transcriptome

• Gene annotation information (start/end of exons in known genes)

1. Genes are located on both strands of DNA

2. Reads are always sequenced from 5’ to 3’

3. Mapping is performed to only (+) strand of DNA

4. Map the reverse-complement of a read: ATTGC, rc: GCAAT

Slide 22 of 31

G C A A T C T G G C



Ambiguous Reads (identify and discard)

A read that is mapped with the same (smallest) number of mismatches to two or more locations in the genome

A read that is mapped to both + (positive) and – (negative) strands with the same smallest number of mismatches


Ambiguous:

Unique:


• Sequencing instruments require certain quantity of mRNA

• Poly Chain Reaction produces multiple copies of mRNA segments

• Copies of the same segment are sequenced producing copy duplicates (product of PCR not related to the mRNA abundance in biological sample)


• Two reads are called copy-duplicates if they are mapped to the same start position in the genome (identify and count only one read)

• Copy duplicates can be generated only from the same sample


• Collect mapping statistics: – Total reads that were attempted for mapping

– Total unique reads mapped

– Total ambiguous reads mapped

– Total copy duplicates

– Distribution of reads by mismatches/indels

– Total reads mapped to splice-junctions

– CG-bias in mapped reads

– Depth of coverage

– 3’ end gene bias (more reads mapped to 3’ end)

Counting Reads: HTSeq

Normalization

• Raw read count has to be normalized to enable comparison between samples

• RPKM Reads Per Kilobase and per Million mapped reads

• Total raw reads mapped to a gene divided by the length of the gene in Kilobases and divided by total number of mapped reads in millions

• Sometimes mappable length is used (since ambiguous reads are discarded, repeated regions within genes are not covered by reads)

Differential Gene Expression Analysis using RNA-Seq...

Documents

Transcript of Differential Gene Expression Analysis using RNA-Seq...