RNAseq Introduction - biology.umd.edu

Bioinformatics Core

RNAseq Introduction

Ian Misner, Ph.D. Bioinformatics Crash Course

Bioinformatics Core

Bioinformatics Core

Many types of RNA

•  rRNA, tRNA, mRNA, miRNA, ncRNA, etc. •  ~2% is mRNA

Bioinformatics Core

Why sequence RNA

•  Functional studies – Drug treated vs untreated cell line – Wild type vs knock out

•  SNP finding •  Transcriptome assembly •  Novel gene finding •  Splice variant analysis

Bioinformatics Core

Challenges

•  Sampling –  Purity?, quantity?, quality?

•  Exons can be problematic – Mapping reads can become difficult

•  RNA abundances vary by orders of magnitude – Highly expressed genes can over power genes of interest – Organeller RNA can block overall signal

•  RNA is fragile and must be properly handled •  RNA population turns over quickly within a cell.

Bioinformatics Core

General workflows •  Obtain raw data •  Align/assemble reads •  Process alignment with a tool specific to the goal •  e.g. ‘cufflinks’ ‘sailfish’

•  Post process •  Import into downstream software (R, Matlab,

Cytoscape, etc.) •  Summarize and visualize •  Create gene lists, prioritize candidates for validation,

etc.

Bioinformatics Core

Experimental Design Questions •  What is my biological question? •  How much sequencing do I need? •  What type of sequencing should I do? – Read length? – Which platform? –  SE or PE?

•  How much multiplexing can I do? •  Should I pool samples? •  How many replicates do I need? •  What about duplicates?

Bioinformatics Core

What are you working with? •  Novel – little or no data •  Some data – ESTs or Unigenes •  Basic Draft Genome

–  Few thousand contigs –  Some annotation, mostly ab initio

•  Good Draft Genome –  Few thousand scaffolds to chromosome arms –  Better annotations with human verification

•  Model Organism –  Fully sequenced genome –  High confidence annotations –  Genetic maps and markers –  Mutant data available

Bioinformatics Core

(a) Increase in biological replication significantly increases the number of DE genes identified.

Liu Y et al. Bioinformatics 2014;30:301-304

Number of Reads/Replicates

Bioinformatics Core

Read Type and Platform Read Type Pla+orm Uses

50 SE Illumina Gene Expression Quan5fica5on SNP-‐finding (Good Reference)

50 PE Illumina Above plus Splice variants

100+ PE Illumina Above plus Transcriptome assembly DE within gene families

200+ Ion Torrent Sanger 454 Nanopore

Splice variants Transcriptome assembly Haplotypes Too large for DE

Bioinformatics Core

Read Platform

Perdue University Discovery Park

Bioinformatics Core

Multiplexing

•  6-8 nt barcodes added to samples during library prep.

•  Allows for pooling of samples into the same lane. – Mitigate lane effects – Maximize sequencing efficiency

•  Dual barcoding allows for up to 96 samples per lane.

Bioinformatics Core

Bioinformatics Core

Replicates

•  Biological – Measurement of variation between samples – More are better – Can detect genetic variation between samples – Pooling with barcodes – each sample is a replicate – Pooling without barcodes – each pool is a replicate

Bioinformatics Core

Replicates

•  Technical – Can determine variation within sample preparation. – Can be cost prohibitive. – More biological replicate are better. – Useful across lanes to mitigate lane effects.

Bioinformatics Core

Bioinformatics Core

Should I remove duplicates? •  Maybe… – Duplicates may correspond to biased PCR amplification

of particular fragments – For highly expressed, short genes, duplicates are

expected even if there is no amplification bias – Removing them may reduce the dynamic range of

expression estimates •  Assess library complexity and decide… •  If you do remove them, assess duplicates at the level of

paired-end reads, not single end reads

Bioinformatics Core

Bioinformatics Core

Processing RNA for Sequencing

•  Depends upon what you’re looking to achieve. •  mRNA is the main target •  PolyA Selection – Oligo-dT beads – Highly efficient at getting mRNA and depleting the

rRNA – Can’t be used with non-polyA RNA

•  miRNA kits as well…

Bioinformatics Core

Strand Specific Sequencing

•  Illumina prep that ligates adaptors to 5’ and 3’ ends of RNA prior to cDNA reverse transcription

•  Having strand information makes mapping more straightforward.

•  Can identify antisense transcripts

5’ 3’

Bioinformatics Core

Insert Sizes

Bioinformatics Core

Alignment Options •  No Genome?! No Problem! – Transcriptome assembly – There will be redundancy

•  NCBI Unigene Set – Not necessarily complete – Good to identify highly expressed genes

•  Valid Transcripts from you organism – Easy to use but may miss novel genes

•  Fully Sequenced and Annotated Genome – No excuses this better be a Nature paper!

Bioinformatics Core

Mapping RNAseq Reads

•  How many mismatches will you allow? – Depends on what your mapping and what your using for

a reference.

•  Number of hits allowed? – How many times can a read match in different locations?

•  Splice Junctions? –  Is your mapping tool “splice aware”?

•  Expected distance for PE reads? – This is important to know so that read pairs can map

properly.

Bioinformatics Core

Why PE reads are great

2 Mismatches Exact Match

Bioinformatics Core

Perdue University Discovery Park

Bioinformatics Core

RNAseq Pipeline

TopHat

Cufflinks

Cuffcompare

CuffDiff

CummRbund

Bioinformatics Core

There are other options

Bioinformatics Core

Not all software is created equal

Bioinformatics Core

Bioinformatics Core

RNAseq “Best Practices”

•  Platform –  Illumina HiSeq

•  Read Length – Minimum of 50bp 100bp is better

•  Paired-end or Single – PE

•  Read Depth – 30-40 million/sample

Bioinformatics Core

RNAseq “Best Practices” •  Number of biological replicates

–  3 or more as cost allows •  Experimental Design

–  Balanced Block •  What type of alignment

–  TopHat – Highly confident and splice aware •  Unique or Multiple mapping

–  Unique –  70-90% mapping rate

•  Analysis Method –  Use more than one approach –  Know the limits of the experiment

RNAseq Introduction - biology.umd.edu

Documents

Transcript of RNAseq Introduction - biology.umd.edu