RNAseq Introduction - UMCPbiology.umd.edu/uploads/2/7/8/0/27804901/rnaseq.pdfExperimental Design...
Transcript of RNAseq Introduction - UMCPbiology.umd.edu/uploads/2/7/8/0/27804901/rnaseq.pdfExperimental Design...
Bioinformatics Core
Why sequence RNA
• Functional studies – Drug treated vs untreated cell line – Wild type vs knock out
• SNP finding • Transcriptome assembly • Novel gene finding • Splice variant analysis
Bioinformatics Core
Challenges
• Sampling – Purity?, quantity?, quality?
• Exons can be problematic – Mapping reads can become difficult
• RNA abundances vary by orders of magnitude – Highly expressed genes can over power genes of interest – Organeller RNA can block overall signal
• RNA is fragile and must be properly handled • RNA population turns over quickly within a cell.
Bioinformatics Core
General workflows • Obtain raw data • Align/assemble reads • Process alignment with a tool specific to the goal • e.g. ‘cufflinks’ ‘sailfish’
• Post process • Import into downstream software (R, Matlab,
Cytoscape, etc.) • Summarize and visualize • Create gene lists, prioritize candidates for validation,
etc.
Bioinformatics Core
Experimental Design Questions • What is my biological question? • How much sequencing do I need? • What type of sequencing should I do? – Read length? – Which platform? – SE or PE?
• How much multiplexing can I do? • Should I pool samples? • How many replicates do I need? • What about duplicates?
Bioinformatics Core
What are you working with? • Novel – little or no data • Some data – ESTs or Unigenes • Basic Draft Genome
– Few thousand contigs – Some annotation, mostly ab initio
• Good Draft Genome – Few thousand scaffolds to chromosome arms – Better annotations with human verification
• Model Organism – Fully sequenced genome – High confidence annotations – Genetic maps and markers – Mutant data available
Bioinformatics Core
(a) Increase in biological replication significantly increases the number of DE genes identified.
Liu Y et al. Bioinformatics 2014;30:301-304
Number of Reads/Replicates
Bioinformatics Core
Read Type and Platform Read Type Pla+orm Uses
50 SE Illumina Gene Expression Quan5fica5on SNP-‐finding (Good Reference)
50 PE Illumina Above plus Splice variants
100+ PE Illumina Above plus Transcriptome assembly DE within gene families
200+ Ion Torrent Sanger 454 Nanopore
Splice variants Transcriptome assembly Haplotypes Too large for DE
Bioinformatics Core
Multiplexing
• 6-8 nt barcodes added to samples during library prep.
• Allows for pooling of samples into the same lane. – Mitigate lane effects – Maximize sequencing efficiency
• Dual barcoding allows for up to 96 samples per lane.
Bioinformatics Core
Replicates
• Biological – Measurement of variation between samples – More are better – Can detect genetic variation between samples – Pooling with barcodes – each sample is a replicate – Pooling without barcodes – each pool is a replicate
Bioinformatics Core
Replicates
• Technical – Can determine variation within sample preparation. – Can be cost prohibitive. – More biological replicate are better. – Useful across lanes to mitigate lane effects.
Bioinformatics Core
Should I remove duplicates? • Maybe… – Duplicates may correspond to biased PCR amplification
of particular fragments – For highly expressed, short genes, duplicates are
expected even if there is no amplification bias – Removing them may reduce the dynamic range of
expression estimates • Assess library complexity and decide… • If you do remove them, assess duplicates at the level of
paired-end reads, not single end reads
Bioinformatics Core
Processing RNA for Sequencing
• Depends upon what you’re looking to achieve. • mRNA is the main target • PolyA Selection – Oligo-dT beads – Highly efficient at getting mRNA and depleting the
rRNA – Can’t be used with non-polyA RNA
• miRNA kits as well…
Bioinformatics Core
Strand Specific Sequencing
• Illumina prep that ligates adaptors to 5’ and 3’ ends of RNA prior to cDNA reverse transcription
• Having strand information makes mapping more straightforward.
• Can identify antisense transcripts
5’ 3’
Bioinformatics Core
Alignment Options • No Genome?! No Problem! – Transcriptome assembly – There will be redundancy
• NCBI Unigene Set – Not necessarily complete – Good to identify highly expressed genes
• Valid Transcripts from you organism – Easy to use but may miss novel genes
• Fully Sequenced and Annotated Genome – No excuses this better be a Nature paper!
Bioinformatics Core
Mapping RNAseq Reads
• How many mismatches will you allow? – Depends on what your mapping and what your using for
a reference.
• Number of hits allowed? – How many times can a read match in different locations?
• Splice Junctions? – Is your mapping tool “splice aware”?
• Expected distance for PE reads? – This is important to know so that read pairs can map
properly.
Bioinformatics Core
RNAseq “Best Practices”
• Platform – Illumina HiSeq
• Read Length – Minimum of 50bp 100bp is better
• Paired-end or Single – PE
• Read Depth – 30-40 million/sample
Bioinformatics Core
RNAseq “Best Practices” • Number of biological replicates
– 3 or more as cost allows • Experimental Design
– Balanced Block • What type of alignment
– TopHat – Highly confident and splice aware • Unique or Multiple mapping
– Unique – 70-90% mapping rate
• Analysis Method – Use more than one approach – Know the limits of the experiment