RNAseq Introduction - UMCPbiology.umd.edu/uploads/2/7/8/0/27804901/rnaseq.pdfExperimental Design...

32
Bioinformatics Core RNAseq Introduction Ian Misner, Ph.D. Bioinformatics Crash Course

Transcript of RNAseq Introduction - UMCPbiology.umd.edu/uploads/2/7/8/0/27804901/rnaseq.pdfExperimental Design...

Bioinformatics  Core

RNAseq Introduction

Ian Misner, Ph.D. Bioinformatics Crash Course

Bioinformatics  Core

Bioinformatics  Core

Many types of RNA

•  rRNA, tRNA, mRNA, miRNA, ncRNA, etc. •  ~2% is mRNA

Bioinformatics  Core

Why sequence RNA

•  Functional studies – Drug treated vs untreated cell line – Wild type vs knock out

•  SNP finding •  Transcriptome assembly •  Novel gene finding •  Splice variant analysis

Bioinformatics  Core

Challenges

•  Sampling –  Purity?, quantity?, quality?

•  Exons can be problematic – Mapping reads can become difficult

•  RNA abundances vary by orders of magnitude – Highly expressed genes can over power genes of interest – Organeller RNA can block overall signal

•  RNA is fragile and must be properly handled •  RNA population turns over quickly within a cell.

Bioinformatics  Core

General workflows •  Obtain raw data •  Align/assemble reads •  Process alignment with a tool specific to the goal •  e.g. ‘cufflinks’ ‘sailfish’

•  Post process •  Import into downstream software (R, Matlab,

Cytoscape, etc.) •  Summarize and visualize •  Create gene lists, prioritize candidates for validation,

etc.

Bioinformatics  Core

Experimental Design Questions •  What is my biological question? •  How much sequencing do I need? •  What type of sequencing should I do? – Read length? – Which platform? –  SE or PE?

•  How much multiplexing can I do? •  Should I pool samples? •  How many replicates do I need? •  What about duplicates?

Bioinformatics  Core

What are you working with? •  Novel – little or no data •  Some data – ESTs or Unigenes •  Basic Draft Genome

–  Few thousand contigs –  Some annotation, mostly ab initio

•  Good Draft Genome –  Few thousand scaffolds to chromosome arms –  Better annotations with human verification

•  Model Organism –  Fully sequenced genome –  High confidence annotations –  Genetic maps and markers –  Mutant data available

Bioinformatics  Core

(a) Increase in biological replication significantly increases the number of DE genes identified.

Liu Y et al. Bioinformatics 2014;30:301-304

Number of Reads/Replicates

Bioinformatics  Core

Read Type and Platform Read  Type   Pla+orm   Uses  

50  SE   Illumina   Gene  Expression  Quan5fica5on  SNP-­‐finding    (Good  Reference)  

50  PE   Illumina   Above  plus  Splice  variants  

100+  PE   Illumina   Above  plus  Transcriptome  assembly  DE  within  gene  families  

200+   Ion  Torrent  Sanger  454  Nanopore  

Splice  variants    Transcriptome  assembly  Haplotypes  Too  large  for  DE  

Bioinformatics  Core

Read Platform

Perdue  University  Discovery  Park  

Bioinformatics  Core

Multiplexing

•  6-8 nt barcodes added to samples during library prep.

•  Allows for pooling of samples into the same lane. – Mitigate lane effects – Maximize sequencing efficiency

•  Dual barcoding allows for up to 96 samples per lane.

Bioinformatics  Core

Bioinformatics  Core

Replicates

•  Biological – Measurement of variation between samples – More are better – Can detect genetic variation between samples – Pooling with barcodes – each sample is a replicate – Pooling without barcodes – each pool is a replicate

Bioinformatics  Core

Replicates

•  Technical – Can determine variation within sample preparation. – Can be cost prohibitive. – More biological replicate are better. – Useful across lanes to mitigate lane effects.

Bioinformatics  Core

Bioinformatics  Core

Bioinformatics  Core

Should I remove duplicates? •  Maybe… – Duplicates may correspond to biased PCR amplification

of particular fragments – For highly expressed, short genes, duplicates are

expected even if there is no amplification bias – Removing them may reduce the dynamic range of

expression estimates •  Assess library complexity and decide… •  If you do remove them, assess duplicates at the level of

paired-end reads, not single end reads

Bioinformatics  Core

Bioinformatics  Core

Processing RNA for Sequencing

•  Depends upon what you’re looking to achieve. •  mRNA is the main target •  PolyA Selection – Oligo-dT beads – Highly efficient at getting mRNA and depleting the

rRNA – Can’t be used with non-polyA RNA

•  miRNA kits as well…

Bioinformatics  Core

Strand Specific Sequencing

•  Illumina prep that ligates adaptors to 5’ and 3’ ends of RNA prior to cDNA reverse transcription

•  Having strand information makes mapping more straightforward.

•  Can identify antisense transcripts

5’   3’  

Bioinformatics  Core

Insert Sizes

Bioinformatics  Core

Alignment Options •  No Genome?! No Problem! – Transcriptome assembly – There will be redundancy

•  NCBI Unigene Set – Not necessarily complete – Good to identify highly expressed genes

•  Valid Transcripts from you organism – Easy to use but may miss novel genes

•  Fully Sequenced and Annotated Genome – No excuses this better be a Nature paper!

Bioinformatics  Core

Mapping RNAseq Reads

•  How many mismatches will you allow? – Depends on what your mapping and what your using for

a reference.

•  Number of hits allowed? – How many times can a read match in different locations?

•  Splice Junctions? –  Is your mapping tool “splice aware”?

•  Expected distance for PE reads? – This is important to know so that read pairs can map

properly.

Bioinformatics  Core

Why PE reads are great

   

   

   

   

   

2  Mismatches   Exact  Match  

Bioinformatics  Core

Perdue  University  Discovery  Park  

Bioinformatics  Core

RNAseq Pipeline

TopHat  

Cufflinks  

Cuffcompare  

CuffDiff  

CummRbund  

Bioinformatics  Core

There are other options

Bioinformatics  Core

Not all software is created equal

Bioinformatics  Core

Bioinformatics  Core

RNAseq “Best Practices”

•  Platform –  Illumina HiSeq

•  Read Length – Minimum of 50bp 100bp is better

•  Paired-end or Single – PE

•  Read Depth – 30-40 million/sample

Bioinformatics  Core

RNAseq “Best Practices” •  Number of biological replicates

–  3 or more as cost allows •  Experimental Design

–  Balanced Block •  What type of alignment

–  TopHat – Highly confident and splice aware •  Unique or Multiple mapping

–  Unique –  70-90% mapping rate

•  Analysis Method –  Use more than one approach –  Know the limits of the experiment