Errors, biases and Quality control
in Next Gen Sequencing
Dr David [email protected]
- Lab scientist : Bioinformatician
- RNA biologist
- small RNAs (miRNA)
Victor Chang Cardiac Research Institute, Sydney, Australia
Testing hypothesis and theories
Errors/Biases:
- Present in all experiments
- Be aware/informed
- Minimise
- Test
Da
ta p
oin
ts
HTS/NGS
Time line
1994 20132009 ME!??? 2013 You???
Next generation sequencing:
- Series of experiments
- Biases/error accumulate!
Anscombe’s Quartet
Image source: Wikipedia
• Maths is a tool for analysis.
• You can blindly ignore biases and errors in data sets.
- mean, stdev, variance, correlation are the same!
Anscombe F.J (1973)
American Statistician
Workflow:
High Throughput Sequencing
Sample
preparation
Library
preparation
Clonal
amplificationSequencing Bioinformatics
Challenges:
Quantification
Purity
(1) Awareness
Community
Literature
Network;
(2) QC considerations
TimeCost
Gels
Stains
Absorbance
Molarity
Titrations
FluoresenceCPUCores
Scripts
Command line
RAM
Threads
ConsumptionThroughput
Genes
GenomeSNPs
Sensitivity/specificity
Cummulative Error
Quantification: Nanodrop spectrophotometer
http://www.nanodrop.com/Library/CVStech_17_11_FINAL.pdf
WARNING!
• Careful of accuracy < 50ng/ul
• Careful of concentrations > 1ug/ul
• Does not assess quality!!
* http://seqanswers.com/forums/showthread.php?t=21280
Contaminants:
230nm: EDTA, carbohydrates,
sodium acetate*, tris*
270nm: Phenol (plus at 230nm*)
280nm: DTT
WARNING!
• Contaminants can impact on downstream
enzymatic reactions
Ratios
260/280 : 1.8 (DNA) 2.0 (RNA)
260/270 : 1.2 – 1.3?
260/230 : 2.0 – 2.2
• Quick
• Consumes 1-2ul sample
• Large dynamic range
(10 – 10,000ng/ul)
• Can identify contaminations
Solution: Re-precipitate/buffer exchange
Quantification:Qubit fluorimeter
WARNING!
• Known biases in quantifying ssRNA < 50ng/ul
• Cannot quantitate ssDNA in presence of dsDNA
• More sensitive than nano-drop
• Consumes small amount of sample
• Specific assays
Quantification
• Consumes small amount of sample
• Quantification
• Estimating nucleic acid size
Agilent Bioanalyzer
WARNING!
• Each chip has a quantitative range
• Sensitive to salts.
• Limitations on size range
• Not accurate quantitating broad smears
* RNA integrity index (RIN)
- Use at least 50ng for meaningful RIN
Schroeder et al (2006) BMC Mol Bio.
Total RNA * 5-500ng/ul
mRNA 25-250ng/ul
Total RNA * 50-5000pg/ul
mRNA 250-5000pg/ul
dsDNA 5-500 pg/ul
(50-7000bp)
Chip Application Quantitative range
Criteria RNA DNA QC
High complexity Trizol vs column
based
Phenol:chloroform
vs column based
qPCR, Northern
blotting??
High quality RIN > 8 Unfragmented Bioanalyzer, gel
electrophoresis
Accurate
Quantification
pg - ng - ug pg - ng - ug Qubit/Nanodrop,
Agilent Bioanalyser
Contamination
(salts, organics)
A260/280 = 2
A260/230 >2
A260/280 = 1.8
A260/230 >2
Qubit, Nanodrop
Enrichment Deplete ribosomes Exome capture qPCR/Agilent
Fragment Uniform peaks better than broad Agilent
GOAL: to have a final sample with high complexity
Sample
preparation
Library
preparation
Clonal
amplificationSequencing Bioinformatics
Sample Purification/Assessment/Processing
1) Library manual as provided by the manufacturer
2) http://nxseq.bitesizebio.com/articles/
Sample
preparation
Library
preparation
Clonal
amplificationSequencing Bioinformatics
miRNAs:
-141 -29b -21 -106b -15a -34a
decreased in cells grown at low
confluence/loss of adhesion
Library prep
+
Sequence
Purification
biases
Cell number
(L) = 200,000
(H) = 800,000
1mL
Trizol
Kim et al., (2011)Molecular Cell 43, 1005-1014
Cell number
Low = 500,000
High = 800,000
Ra
tio
14
1/2
00
c
Kim et al., (2012)Molecular Cell 46, 893-895
• Small RNA ppt with longer RNA
• Most susceptible:
Low GC content, 2ndary structure
Ligation biases-Enzyme
-Temperature
-Sequence
Sample
preparation
Library
preparation
Clonal
amplificationSequencing Bioinformatics
miRNA
library
biases
Hafner et al., (2011)“RNA-ligase-dependent biases in miRNA ….. cDNA libraries”RNA 17(9), 1-16
Input:- 770 synthetic miRNAs
- 45 designed RNAs
Reverse Transcription biasNot a significant source of
sequence specific biases
Pool A = Equimolar
Pool B = 10 fold serial dilution
PCR biasDilute 1:10000
10 PCR cycles
- No appreciable distortion!
5 x
WARNING!
• Don’t compare NGS data sets from different library preps
• Be consistent with incubation times/temperatures
• Ross et al., Characterizing and measuring bias in sequence data. Genome Biology 2013
• Bragg et al., Shining a light on Dark sequencing characterising errors. PLoS Comp Biol 2013
• Loman et al., Performance comparison of benchtop HTS platforms. Nature Biotech 2012
• Quail et al., Tale of three NGS platforms. BMC Genomics, 2012
• Lam et al., Performance comparison of whole genome sequencing platforms. Nat Biotech 2012
Sample
preparation
Library
preparation
Clonal
amplificationSequencing Bioinformatics
Ion torrent
Illumina
Complete genomics
Kapa Biosystems
Standard reagents
Flowcell/lane variations do occur
Smaller than those observed
between platforms
Sequencing platforms
Sample
preparation
Library
preparation
Clonal
amplificationSequencing Bioinformatics
Raw sequencing files
Assessing sequence quality
Align (pipeline)
Assessing alignment data
Raw
sequencing
files
Assessing
sequence
quality
Align
(pipeline)
Assessing
alignment
data
The Basics:
0 10 20 30 40. . . .
! “ # $ % & ‘ ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h
Numerical :
Phred+33 :
Phred+64 :
Quality values: Phred score
File types: fastq, csfasta, qual, fasta, xsq
Sequence: A T C G N/.
Header: Coordinates/other
VCCRI
Raw
sequencing
files
Assessing
sequence
quality
Align
(pipeline)
Assessing
alignment
data
• Free java utility that can assess QC metrics of HTS data sets.
- GUI
- Command line
- Can create html output
• fastq (standard, gzip, colorspace, casava), SAM/BAM
Not all data sets require full complement of green ticks!!
VCCRI
Very good
Reasonable
Poor
Median
90%
10%
75%
25%
Raw sequencing
files
Assessing
sequence
quality
Align
(pipeline)
Assessing
alignment
data
Mean
Identify adaptors
and primers
VCCRIRaw sequencing
files
Assessing
sequence
quality
Align
(pipeline)
Assessing
alignment
data
Identifies if subset
of sequences have
low quality
May identify cycles
that are unreliable
Helps assess raw data files prior to mapping- low quality data may cause incorrect alignments
- low quality data may incorrectly call variations
- Sequence with trailing adaptor sequences will not map
Aligners
Choose a suitable reference.
Include mitochondrial sequence
Design a filter set to capture repeated sequences (rRNA, tRNA)
Reference
Be aware of the default options
- Accepted Errors
- Multimappers
Raw sequencing files
Assessing
sequence
quality
Align
(pipeline)
Assessing alignment
data
Different aligners can give different results.
Benchmarking short sequence mapping tools
Hatem et al (BMC Bioinformatics, 2013)
Assessing alignment data
Raw sequencing
files
Assessing
sequence
quality
Align
(pipeline)
Assessing
alignment
data
Mapping statistics
PassQuestionable
Alignment feature statistics
- Coverage
- Expression
- Discovery
Test
Filter raw data
- Filter
- Trim
Important!
• Know your mapping statistics
• Know what to expect from your data sets
• Test on existing data set
Include a filter
% mapped
% mapped at what length
Take home messages
Be familiar with existing data sets
• NGS is a collection of experiments
• Biases/errors can/will occur at all steps of a high throughput sequencing study
• QC measures should applied at all steps of a high throughput sequencing study
• Don’t be alarmed, stay informed
miRNA sequencing profilingmiRspring
• Small (<2MB) HTML document that replicates the miRNA aligned sequencing data.
• Needs NO internet connectivity.
• Provides visualization of sequence data
• Reports on miRNA processing
• Complete transparency.
Humphreys D.T., and Suter C.M. Nucleic Acids Research 2013.
http://miRspring.victorchang.edu.au
microRNAsmiRspring reporting tools
5’ 3’
i
i
i) 5’ isomiRs
ii
ii) 3’ isomiRsii
iii
iii) Non-canonical
iv
iv) Arm bias
v) miRNA length
v v
A � G
C � Tvi
vi) RNA editing
• Small non-coding RNAs (22nt)
• Bind to 3’UTRs � decay and/or translational repression
• Biogenesis: Derived from longer stem loop precursors
miRspring
miRNA clusters
Mono-cistronic Poly-cistronic
miRNA Seed analysis
miR-196a UAGGUAGUUUCCUGUUGUUGGG
let-7a UGAGGUAGUAGGUUGUAUAGUUU
AGGUAGU
GAGGUAGlet-7a UGAGGUAGUAGGUUGUAUAGUUU
GenomicGenomic
miRspring QC features
Sampling bias!
Tissue
Atlas
Heart
Kidney
Liver
Lung
Ovary
Spleen
Testes
Thymus
Brain
Placenta
AGO IP
THP-1
ENCODE
HeLa S3
A549
Ag04450
Bj
Gm1287
H1hesc
HepG2
Huvec
K562
MCF7
NheK
Sknshra
• 73 miRspring documents
• 895 million sequence tags
• < 55 megabytes of disk space
miRspring reporting features
Top 100 miRNAs typically:
- 22nt long
- Good correlation with miRBase
miRspring provide a quick easy way to analyse QC parameters of your data set
Centile RankCentile Rank
Final points
Victor Chang Cardiac Research Institute, Sydney, Australia
• Many NGS protocols are well established.- Worth understanding what variations/features are found in data sets.
• miRspring a powerful tools to help you assess a data set- Yes only examines one data set at a time.
- Provides complete transparency
- Allows ANYONE to examine a NGS data set.
Example miRspring documents can be found at http://miRspring.victorchang.edu.au
AcknowledgementsVCCRI
Cath Suter
Paul Young
Rupert Shuttleworth
Diane Fatkin
Monique Ohanian
Djordje Djordjevic
Chris Hayward
Kavitha Muthiah
Richard Harvey
Mirana Ramialison
Ashley Waardenberg
IT
Timothy Kersten
Pardeep Dhiman
Thomas Priess (VCCRI/ANU)
- Pardeep Patel
- Carly Hynes
- Tennille Sibbritt
- Jennifer Clancy
Matthias Hentze (EMBL)
Funding bodies
ARC
NHMRC
Viertel Charitable Foundation
Perpetual Trust
VCCRI
Top Related