Download - Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Bioinformatics Summer School 2014

Konstantin OkonechnikovMax Planck Institute For Infection

Biology

Quality Control of High Throughput Sequencing Data

Летняя Школа Биоинформатики 2014

If we lived in a perfect world...

Meanwhile in the real world...

Quality control of High Throughput Sequencing Data

● HTS is a complex technology; it is prone to biases and errors

● Errors might lead to wrong conclusions

● Understanding biases and limitations is critical for analysis of HTS data and inference

● Bioinformatics methods exist to detect biases

● Bias handling is technolgy-specific

● Experimental design is extremely important

A bit of nomenclature (reminder)

● Basepairs = bp = основания

● Sequencing = секвенирование

● Short reads = короткие риды

● Alignment = выравнивание

● Assembly = сборка

● Coverage = покрытие

● GC content = GC-состав

● Others: BAM/SAM format, WGS, RNA-seq, ChIP-seq

Illumina sequencing overview

Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

Sources of errors and biases

● DNA preparation: biological contamination, biased fragment selection

● PCR amplfiction: GC-content shift, fragment duplication, adapter contamination

● Sequencing: base substitutions and indels

● Techonology specific biases: RNA-seq, ChIP-seq etc.

● Analysis errors: algorithm errors, inadequate model, human errors

Detecting biases: FastQC

● Input: raw read analysis

● Output: interactive GUI, HTML report

● Metrics:

– Per base statistics

– Per base quality profile

– Per sequence ACGTN content

– Sequence length distribution

– Duplicate sequences

– Overrepresented sequences, adapters, kmer content

● Link: www.bioinformatics.babraham.ac.uk/projects/fastqc

Detecting biases: QualiMap

● Input: BAM file, optionally genomic regions in GTF/GFF format

● Output: interactive GUI, HTML and PDF report

● Metrics:

– Summary statistics of alignment (coverage, ACGT, insert size, mapping quality, mismatches and indels etc.)

– Coverage across reference and various histograms

– Duplication rate

– Homopolymer indels

– Mapping quality plots

– Insert size plots

– Mapped reads GC-content and distribution

● Link: http://qualimap.bioinfo.cipf.es/

Read errors

Illumina read profile: quality decreases towards 3' end

Typical errors rates:Substitutions: 0.1 — 0.3 % Indels: ~10E-5

Based on: http://genomebiology.com/2011/12/11/R112

Read errors

● Cut 3' prime end

● Remove reads with bad quality

– Empirical rule: keep only reads that have more than ⅔ of reads Q> 30

● Tools: FastX, Cutadapt, trimmomatic

● What about other platforms?

– 454 : homopolymers

– PacBio: increased error rate (up to 20%)

GC-content problems

GC-content distribution: compare to theoretical

GC-content problems

Compare and normalize according to expected distribution

Fragment duplication

● Duplicates can be removed using picard tools.

Contaminations

● Biological contamination

– Map and remove reads (bacterial, rRNA, etc)

● Adapter contamination

– Solution: cut adapters or remove reads containing them (cutadapt, scythe, trimmomatic)

Alignment analysis: descritpive statistics

● Read metrics: mapped, paired, chimeric, singletons.

● Mismatch and indel count

● Coverage

● Mapping quality

● Insert size

Hint:

use Qualimap

Alignment analysis: coverage

● Coverage histogram

Alignment analysis: insert size

● Insert size histogram (QualiMap, picard tools)

Multisample Analysis

● Detection of outlier in group of sequences: clustering and PCA

RNA-seq specific

● (Not so) random hexamer primers

RNA-seq specific

● Counts analysis: sequencing saturation

RNA-seq specific

● Counts analysis: feature distribution

ChIP-seq specific

Analysis errors

● Statistical model and simulations

● Self checks:

– Visualizations

– Edge conditions

● Published data examination (ENCODE)

● Method cross-checks

Conclusions

● HTS is prone to random errors and systematic biases

● Quality control is critical for analysis

● There are tools available for detection and removal of QC-relatated problems

● Additional QC analysis should be performed based on problem (genome assembly, SNP calling, etc) and technology (RNA-seq, ChiP-seq, <your choise>-seq...)

Tools for QC and EDA

Estimating quality metrics

● FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

● QualiMap http://qualimap.bioinfo.cipf.es/

Removing errors and cleaning reads

● FastX http://hannonlab.cshl.edu/fastx_toolkit/

● Cutadapt https://code.google.com/p/cutadapt/

● Trimmomatic http://www.usadellab.org/cms/?page=trimmomatic

● Picard-tools http://picard.sourceforge.net/

Technology specific quality control

● RNA-seq: Rnaseq-QC, RSeqQC

● ChIP-seq: CEAS, Repitools

● Genome Assembly: Quast

Thank you for your attention!

Спасибо за внимание!