Bioinformatics Summer School 2014
Konstantin OkonechnikovMax Planck Institute For Infection
Biology
Quality Control of High Throughput Sequencing Data
Летняя Школа Биоинформатики 2014
If we lived in a perfect world...
Meanwhile in the real world...
Quality control of High Throughput Sequencing Data
● HTS is a complex technology; it is prone to biases and errors
● Errors might lead to wrong conclusions
● Understanding biases and limitations is critical for analysis of HTS data and inference
● Bioinformatics methods exist to detect biases
● Bias handling is technolgy-specific
● Experimental design is extremely important
A bit of nomenclature (reminder)
● Basepairs = bp = основания
● Sequencing = секвенирование
● Short reads = короткие риды
● Alignment = выравнивание
● Assembly = сборка
● Coverage = покрытие
● GC content = GC-состав
● Others: BAM/SAM format, WGS, RNA-seq, ChIP-seq
Illumina sequencing overview
Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
Illumina sequencing overview
Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
Illumina sequencing overview
Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
Sources of errors and biases
● DNA preparation: biological contamination, biased fragment selection
● PCR amplfiction: GC-content shift, fragment duplication, adapter contamination
● Sequencing: base substitutions and indels
● Techonology specific biases: RNA-seq, ChIP-seq etc.
● Analysis errors: algorithm errors, inadequate model, human errors
Detecting biases: FastQC
● Input: raw read analysis
● Output: interactive GUI, HTML report
● Metrics:
– Per base statistics
– Per base quality profile
– Per sequence ACGTN content
– Sequence length distribution
– Duplicate sequences
– Overrepresented sequences, adapters, kmer content
● Link: www.bioinformatics.babraham.ac.uk/projects/fastqc
Detecting biases: QualiMap
● Input: BAM file, optionally genomic regions in GTF/GFF format
● Output: interactive GUI, HTML and PDF report
● Metrics:
– Summary statistics of alignment (coverage, ACGT, insert size, mapping quality, mismatches and indels etc.)
– Coverage across reference and various histograms
– Duplication rate
– Homopolymer indels
– Mapping quality plots
– Insert size plots
– Mapped reads GC-content and distribution
● Link: http://qualimap.bioinfo.cipf.es/
Read errors
Illumina read profile: quality decreases towards 3' end
Typical errors rates:Substitutions: 0.1 — 0.3 % Indels: ~10E-5
Based on: http://genomebiology.com/2011/12/11/R112
Read errors
● Cut 3' prime end
● Remove reads with bad quality
– Empirical rule: keep only reads that have more than ⅔ of reads Q> 30
● Tools: FastX, Cutadapt, trimmomatic
● What about other platforms?
– 454 : homopolymers
– PacBio: increased error rate (up to 20%)
GC-content problems
GC-content distribution: compare to theoretical
GC-content problems
Compare and normalize according to expected distribution
Fragment duplication
● Duplicates can be removed using picard tools.
Contaminations
● Biological contamination
– Map and remove reads (bacterial, rRNA, etc)
● Adapter contamination
– Solution: cut adapters or remove reads containing them (cutadapt, scythe, trimmomatic)
Alignment analysis: descritpive statistics
● Read metrics: mapped, paired, chimeric, singletons.
● Mismatch and indel count
● Coverage
● Mapping quality
● Insert size
Hint:
use Qualimap
Alignment analysis: coverage
● Coverage histogram
Alignment analysis: insert size
● Insert size histogram (QualiMap, picard tools)
Multisample Analysis
● Detection of outlier in group of sequences: clustering and PCA
RNA-seq specific
● (Not so) random hexamer primers
RNA-seq specific
● Counts analysis: sequencing saturation
RNA-seq specific
● Counts analysis: feature distribution
ChIP-seq specific
Analysis errors
● Statistical model and simulations
● Self checks:
– Visualizations
– Edge conditions
● Published data examination (ENCODE)
● Method cross-checks
Conclusions
● HTS is prone to random errors and systematic biases
● Quality control is critical for analysis
● There are tools available for detection and removal of QC-relatated problems
● Additional QC analysis should be performed based on problem (genome assembly, SNP calling, etc) and technology (RNA-seq, ChiP-seq, <your choise>-seq...)
Tools for QC and EDA
Estimating quality metrics
● FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
● QualiMap http://qualimap.bioinfo.cipf.es/
Removing errors and cleaning reads
● FastX http://hannonlab.cshl.edu/fastx_toolkit/
● Cutadapt https://code.google.com/p/cutadapt/
● Trimmomatic http://www.usadellab.org/cms/?page=trimmomatic
● Picard-tools http://picard.sourceforge.net/
Technology specific quality control
● RNA-seq: Rnaseq-QC, RSeqQC
● ChIP-seq: CEAS, Repitools
● Genome Assembly: Quast
Thank you for your attention!
Спасибо за внимание!
Top Related