RNAseq Introduction - biology.umd.edu
Transcript of RNAseq Introduction - biology.umd.edu
Bioinformatics Core
RNAseq Introduction
Ian Misner, Ph.D. Bioinformatics Crash Course
Bioinformatics Core
Bioinformatics Core
Many types of RNA
• rRNA, tRNA, mRNA, miRNA, ncRNA, etc. • ~2% is mRNA
Bioinformatics Core
Why sequence RNA
• Functional studies – Drug treated vs untreated cell line – Wild type vs knock out
• SNP finding • Transcriptome assembly • Novel gene finding • Splice variant analysis
Bioinformatics Core
Challenges
• Sampling – Purity?, quantity?, quality?
• Exons can be problematic – Mapping reads can become difficult
• RNA abundances vary by orders of magnitude – Highly expressed genes can over power genes of interest – Organeller RNA can block overall signal
• RNA is fragile and must be properly handled • RNA population turns over quickly within a cell.
Bioinformatics Core
General workflows • Obtain raw data • Align/assemble reads • Process alignment with a tool specific to the goal • e.g. ‘cufflinks’ ‘sailfish’
• Post process • Import into downstream software (R, Matlab,
Cytoscape, etc.) • Summarize and visualize • Create gene lists, prioritize candidates for validation,
etc.
Bioinformatics Core
Experimental Design Questions • What is my biological question? • How much sequencing do I need? • What type of sequencing should I do? – Read length? – Which platform? – SE or PE?
• How much multiplexing can I do? • Should I pool samples? • How many replicates do I need? • What about duplicates?
Bioinformatics Core
What are you working with? • Novel – little or no data • Some data – ESTs or Unigenes • Basic Draft Genome
– Few thousand contigs – Some annotation, mostly ab initio
• Good Draft Genome – Few thousand scaffolds to chromosome arms – Better annotations with human verification
• Model Organism – Fully sequenced genome – High confidence annotations – Genetic maps and markers – Mutant data available
Bioinformatics Core
(a) Increase in biological replication significantly increases the number of DE genes identified.
Liu Y et al. Bioinformatics 2014;30:301-304
Number of Reads/Replicates
Bioinformatics Core
Read Type and Platform Read Type Pla+orm Uses
50 SE Illumina Gene Expression Quan5fica5on SNP-‐finding (Good Reference)
50 PE Illumina Above plus Splice variants
100+ PE Illumina Above plus Transcriptome assembly DE within gene families
200+ Ion Torrent Sanger 454 Nanopore
Splice variants Transcriptome assembly Haplotypes Too large for DE
Bioinformatics Core
Read Platform
Perdue University Discovery Park
Bioinformatics Core
Multiplexing
• 6-8 nt barcodes added to samples during library prep.
• Allows for pooling of samples into the same lane. – Mitigate lane effects – Maximize sequencing efficiency
• Dual barcoding allows for up to 96 samples per lane.
Bioinformatics Core
Bioinformatics Core
Replicates
• Biological – Measurement of variation between samples – More are better – Can detect genetic variation between samples – Pooling with barcodes – each sample is a replicate – Pooling without barcodes – each pool is a replicate
Bioinformatics Core
Replicates
• Technical – Can determine variation within sample preparation. – Can be cost prohibitive. – More biological replicate are better. – Useful across lanes to mitigate lane effects.
Bioinformatics Core
Bioinformatics Core
Bioinformatics Core
Should I remove duplicates? • Maybe… – Duplicates may correspond to biased PCR amplification
of particular fragments – For highly expressed, short genes, duplicates are
expected even if there is no amplification bias – Removing them may reduce the dynamic range of
expression estimates • Assess library complexity and decide… • If you do remove them, assess duplicates at the level of
paired-end reads, not single end reads
Bioinformatics Core
Bioinformatics Core
Processing RNA for Sequencing
• Depends upon what you’re looking to achieve. • mRNA is the main target • PolyA Selection – Oligo-dT beads – Highly efficient at getting mRNA and depleting the
rRNA – Can’t be used with non-polyA RNA
• miRNA kits as well…
Bioinformatics Core
Strand Specific Sequencing
• Illumina prep that ligates adaptors to 5’ and 3’ ends of RNA prior to cDNA reverse transcription
• Having strand information makes mapping more straightforward.
• Can identify antisense transcripts
5’ 3’
Bioinformatics Core
Insert Sizes
Bioinformatics Core
Alignment Options • No Genome?! No Problem! – Transcriptome assembly – There will be redundancy
• NCBI Unigene Set – Not necessarily complete – Good to identify highly expressed genes
• Valid Transcripts from you organism – Easy to use but may miss novel genes
• Fully Sequenced and Annotated Genome – No excuses this better be a Nature paper!
Bioinformatics Core
Mapping RNAseq Reads
• How many mismatches will you allow? – Depends on what your mapping and what your using for
a reference.
• Number of hits allowed? – How many times can a read match in different locations?
• Splice Junctions? – Is your mapping tool “splice aware”?
• Expected distance for PE reads? – This is important to know so that read pairs can map
properly.
Bioinformatics Core
Why PE reads are great
2 Mismatches Exact Match
Bioinformatics Core
Perdue University Discovery Park
Bioinformatics Core
RNAseq Pipeline
TopHat
Cufflinks
Cuffcompare
CuffDiff
CummRbund
Bioinformatics Core
There are other options
Bioinformatics Core
Not all software is created equal
Bioinformatics Core
Bioinformatics Core
RNAseq “Best Practices”
• Platform – Illumina HiSeq
• Read Length – Minimum of 50bp 100bp is better
• Paired-end or Single – PE
• Read Depth – 30-40 million/sample
Bioinformatics Core
RNAseq “Best Practices” • Number of biological replicates
– 3 or more as cost allows • Experimental Design
– Balanced Block • What type of alignment
– TopHat – Highly confident and splice aware • Unique or Multiple mapping
– Unique – 70-90% mapping rate
• Analysis Method – Use more than one approach – Know the limits of the experiment