Obstacles and challenges in the analysis of microRNA sequencing...
Transcript of Obstacles and challenges in the analysis of microRNA sequencing...
Obstacles and challenges in the analysis of microRNA sequencing data
(miRNA-Seq)
David Humphreys
Genomics core
Dr Victor Chang AC 1936-1991, Pioneering Cardiothoracic Surgeon and Humanitarian
The ABCs about miRNAs (Annotation, Biogenesis, Curation)
www.mirbase.org• Mature fasta file• Stem loop fasta file• Gff (genome coordinate file)
miRNA-Seq applications
Read length covers entire mature transcript
Discovery
- Novel miRNAs
- Isoforms
- Biogenesisiii ) non canonical processingiv) Strand selectionv) length/ non-template additions
Quantification
- Differentially expressed miRNAs
- Differential processing
Experimental design
• Sample selection• Species, replicates
• RNA extraction
• Library preparation
Kim et al., (2011)Molecular Cell 43, 1005-1014
Low confluence = 500,000 cellsHigh confluence = 800,000 cells
Cell number(L) = 200,000(H) = 800,000
RNA extraction
ColumnLiquid Bead
Prep time ++ ++++ +++
miRNA purification +++ ++++ ++++
Recovery ++++ +++ +++
Limitations/pitfalls Low input miRNA bias
Early protocols no miRNA ???
Kim et al., (2012)Molecular Cell 46, 893-895
NO change!!
Rati
o 1
41/2
00c
Down regulated miRNAs:
141, 29b , 21, 106b, 15a, 34a
• Most susceptible:
- Low GC content,
- 2ndary structure
• Small RNA ppt with longer RNA
RNA quantification and integrity
Nano drop Qubit Agilent
seqanswers.com/forums/showthread.php?t=21280
WARNING!- Accuracy poor below 50ng/ul- Careful of concentrations > 1ug/ul
WARNING!- Known biases in quantifying
ssRNA < 50ng/ul
230 260 280
WARNING!- Quantification only accurate in
the defined range (read manual)
Assays specific for DNA/RNA Quantitate sizeCan detect salt & other contaminants
Absorb
ance
Library prep kit comparison
Sample prep
Adaptor ligation
RT(Reverse
Transcription)
PCR
miRNAP- -OH
miRNA
i) Hybridisation
ii) Ligation
iii) DenaturationSequential Ligation
miRNA miRNA
# Hafner et al., (2011) RNA 17(9), 1-16
# Sequence# Temperature# Incubation times
# PCR cycles … OK
# Input amount
# PH, buffers/salts/ATP
Summary
• Sample selection• Species, replicates
• RNA extraction• Use same method for all preps
• Quantify (2 methods)
• Assess integrity
• Library preparation• Consistent input
• Consistent ligation conditions (time/temperature)
• Use same kits
miRNA-Seq Bioinformatics
(Trim - ALIGN – Report)
Anscombe’s Quartet
• Maths is a tool for analysis.• You can blindly ignore biases and errors in data sets.
- mean, stdev, variance, correlation are the same!
Image from wikipediahttps://en.wikipedia.org/wiki/Anscombe%27s_quartet
Challenges
Multimappers
Mismatches
AlignersSharing data
• Length of a sequence read covers entire microRNA transcript
• Upstream bias will have impacts on analysis
Sample preparation
SequencingLibrary preparation
Clonal amplification
Bioinformatics
Normalisation
Differential expression
Feature counting
Visualisation
Choice of reference?
Genome miRBase stem-loop
Better discovery
Possible incorrect/loss of mappings Forced (biased) mapping
Faster, less complicated.Slower, computationally restrictive?
Limited discovery
miR-486
Multi-mappers (1)
• miRBase does NOT ACCURATELY report number of times a read aligns to genome
• Multi-loci miRBase entries provide some information
0
40
80
120
160
200
0 20 40 60 80 > 100
Number of mapped locations
Num
ber
miR
s
Human multi-mappers #
miR-486
# Human miRbase entries mapped using bowtie aligner allowing all multi-mappers
Example
Multi-mappers (2)
• Multi-mapping rate increases as read length decreases.
• What should the minimum length miRNA read?
• Shortest length in miRbase is 17nt !
miR-133 family
miR-133a-1-3p uuugguccccuucaaccagcug
miR-133a-1-3p uuugguccccuucaaccagcug
miR-133b-1-3p uuugguccccuucaaccagcua
• Where do you assign multi-loci counts?
- Assign to each position?
- Assign fraction to each position?
- Intelligently assign to a position?
- Ignore?
miR-133a
miR-133b
Mismatches
• Sequencing Variantsi) Error in library prep
ii) Variants in reference genome
iii) Sequencer
• RNA editing
Type Enzyme Comment
A to I (G) ADAR Predominantly on pre-miRs
C to T Apobec Not identified yet?
Chawla et al., (2014) Nucleic Acids Research, 42 (8): 5245–5255Tomaselli et al., (2013) Int. J. Mol. Sci. 14, 22796-22816
Ohanian et al. (2013) BMC Genetics, 14:18
Aligners
• (Too) Many choices…
• Each aligner has a wide array of options with DIFFERENT default settings.
• Bowtie aligner provides error rate and multi-mapping control :
bowtie -p 4 -n 1 -l 21 --nomaqround -k 10 --best --strata --chunkmbs 256
Report up to 10 multi-mappers
Allow 1 mismatch in a length of 21nt
Fastq calibration dataset:
hsa-let-7f-5p_M_chr9_94176353_94176374_+#chrX_53557246_53557267_- 0 chr9 94176353
255 22M * 0 0 TGAGGTAGTAGATTGTATAGTT
• Available for ALL species present in miRBase, features include:
i) Each header defines miRBase mapping location
ii) Contains all miRbase entries with all single nucleotide mismatches
miRNA ID Mapping location #1 Mapping location #2
Non template additions (NTA)
i) Adenylation
ii) Uridylation
Koppers-Lalic et al., (2014), Cell Reports 8, 1649–1658
DETECTION METHODS:
• Aligners tend to softclip 3’ mismatches!!
• Remove adaptor- Hard trim (18nt)- Extend alignment. - Look for mismatch clusters at end of read.
<miRNA seq> + (A)n
<miRNA seq> + (T)n
Assigning miRNA counts
Mature miRNA analysis
i) 5’ isomirsii) 3’ isomirsiii) Non canonicaliv) Arm switchingv) Lengthvi) Editing
Cistronic Analysis(i) (ii)
Humphreys et al., 2013, NAR
miRspring
• Small (<2MB) HTML document that replicates the miRNA aligned sequencing data.
• Needs NO internet connectivity.
• Provides visualization of sequence data + research tools == complete transparency.
http://miRspring.victorchang.edu.au
Humphreys D.T., and Suter C.M. Nucleic Acids Research 2013.
Cummulative distribution of miRNA reads
Sampling bias!
TissueAtlas
HeartKidneyLiverLungOvarySpleenTestes
ThymusBrain
Placenta
AGO IP
THP-1
ENCODE
HeLa S3A549
Ag04450Bj
Gm1287H1hescHepG2HuvecK562MCF7NheK
Sknshra• 73 miRspring documents
• 895 million sequence tags
• < 55 megabytes of disk space
In most cell lines and tissues the most
abundant miRNA should comprise < 35% of all
aligned miRNA sequences
OK ☺
Top 100 miRNAs typically:- 22nt long- Good correlation with miRBase
Conclusions
• Many challenges in miRNA-seq analysis
• Multi-mappers
• Mismatches
• Best practises…. be methodical
• Know the question you wish to address
• Know your species (reference/miRbase)
• Know your aligner
• Test your pipeline!
• Know what you are missing
• Quality control metrics/ visualisation
Joshua Ho
Peter Szot
Catherine Suter
Diane Fatkin
Thomas Priess
St Vincent’s Hospital
Chris Hayward
Kavitha
Andrew Jabbour
If you would like a miRBase test data set for any species/reference combination
please don’t hesistate to contact me.
miRspring.victorchang.edu.au
- Fastq synthetic data sets
- Intelligently assign multi-mappers
- R objects