Bioinformatics Summer School 2014 Konstantin Okonechnikov...

Post on 02-Aug-2020

0 views 0 download

Transcript of Bioinformatics Summer School 2014 Konstantin Okonechnikov...

Bioinformatics Summer School 2014

Konstantin OkonechnikovMax Planck Institute For Infection

Biology

Quality Control of High Throughput Sequencing Data

Летняя Школа Биоинформатики 2014

If we lived in a perfect world...

Meanwhile in the real world...

Quality control of High Throughput Sequencing Data

● HTS is a complex technology; it is prone to biases and errors

● Errors might lead to wrong conclusions

● Understanding biases and limitations is critical for analysis of HTS data and inference

● Bioinformatics methods exist to detect biases

● Bias handling is technolgy-specific

● Experimental design is extremely important

A bit of nomenclature (reminder)

● Basepairs = bp = основания

● Sequencing = секвенирование

● Short reads = короткие риды

● Alignment = выравнивание

● Assembly = сборка

● Coverage = покрытие

● GC content = GC-состав

● Others: BAM/SAM format, WGS, RNA-seq, ChIP-seq

Illumina sequencing overview

Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

Illumina sequencing overview

Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

Illumina sequencing overview

Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

Sources of errors and biases

● DNA preparation: biological contamination, biased fragment selection

● PCR amplfiction: GC-content shift, fragment duplication, adapter contamination

● Sequencing: base substitutions and indels

● Techonology specific biases: RNA-seq, ChIP-seq etc.

● Analysis errors: algorithm errors, inadequate model, human errors

Detecting biases: FastQC

● Input: raw read analysis

● Output: interactive GUI, HTML report

● Metrics:

– Per base statistics

– Per base quality profile

– Per sequence ACGTN content

– Sequence length distribution

– Duplicate sequences

– Overrepresented sequences, adapters, kmer content

● Link: www.bioinformatics.babraham.ac.uk/projects/fastqc

Detecting biases: QualiMap

● Input: BAM file, optionally genomic regions in GTF/GFF format

● Output: interactive GUI, HTML and PDF report

● Metrics:

– Summary statistics of alignment (coverage, ACGT, insert size, mapping quality, mismatches and indels etc.)

– Coverage across reference and various histograms

– Duplication rate

– Homopolymer indels

– Mapping quality plots

– Insert size plots

– Mapped reads GC-content and distribution

● Link: http://qualimap.bioinfo.cipf.es/

Read errors

Illumina read profile: quality decreases towards 3' end

Typical errors rates:Substitutions: 0.1 — 0.3 % Indels: ~10E-5

Based on: http://genomebiology.com/2011/12/11/R112

Read errors

● Cut 3' prime end

● Remove reads with bad quality

– Empirical rule: keep only reads that have more than ⅔ of reads Q> 30

● Tools: FastX, Cutadapt, trimmomatic

● What about other platforms?

– 454 : homopolymers

– PacBio: increased error rate (up to 20%)

GC-content problems

GC-content distribution: compare to theoretical

GC-content problems

Compare and normalize according to expected distribution

Fragment duplication

● Duplicates can be removed using picard tools.

Contaminations

● Biological contamination

– Map and remove reads (bacterial, rRNA, etc)

● Adapter contamination

– Solution: cut adapters or remove reads containing them (cutadapt, scythe, trimmomatic)

Alignment analysis: descritpive statistics

● Read metrics: mapped, paired, chimeric, singletons.

● Mismatch and indel count

● Coverage

● Mapping quality

● Insert size

Hint:

use Qualimap

Alignment analysis: coverage

● Coverage histogram

Alignment analysis: insert size

● Insert size histogram (QualiMap, picard tools)

Multisample Analysis

● Detection of outlier in group of sequences: clustering and PCA

RNA-seq specific

● (Not so) random hexamer primers

RNA-seq specific

● Counts analysis: sequencing saturation

RNA-seq specific

● Counts analysis: feature distribution

ChIP-seq specific

Analysis errors

● Statistical model and simulations

● Self checks:

– Visualizations

– Edge conditions

● Published data examination (ENCODE)

● Method cross-checks

Conclusions

● HTS is prone to random errors and systematic biases

● Quality control is critical for analysis

● There are tools available for detection and removal of QC-relatated problems

● Additional QC analysis should be performed based on problem (genome assembly, SNP calling, etc) and technology (RNA-seq, ChiP-seq, <your choise>-seq...)

Tools for QC and EDA

Estimating quality metrics

● FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

● QualiMap http://qualimap.bioinfo.cipf.es/

Removing errors and cleaning reads

● FastX http://hannonlab.cshl.edu/fastx_toolkit/

● Cutadapt https://code.google.com/p/cutadapt/

● Trimmomatic http://www.usadellab.org/cms/?page=trimmomatic

● Picard-tools http://picard.sourceforge.net/

Technology specific quality control

● RNA-seq: Rnaseq-QC, RSeqQC

● ChIP-seq: CEAS, Repitools

● Genome Assembly: Quast

Thank you for your attention!

Спасибо за внимание!