NGS - QC & Dataformat

42
NGS Data Formats & QC Analysis Karan Veer Singh Scientist, NBAGR 06/20/22 1

description

The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification this ppt show parameters for NGS Data quality check and Dataformat of top sequencing machine

Transcript of NGS - QC & Dataformat

Page 1: NGS - QC & Dataformat

NGS

Data Formats & QC Analysis

Karan Veer SinghScientist, NBAGR

04/11/231

Page 2: NGS - QC & Dataformat

Sequence Formats All Sequence formats are ASCII text containing

sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence

Formats are designed to hold sequence data and other information about sequence

04/11/232

Page 3: NGS - QC & Dataformat

Why so many formats?

04/11/233

Created based on the information required for each step of analysis

Efficient Data & time management

Each Data formats vary in the information they contain

Types of sequence file formats

•Raw Sequence files •Co-ordinate files•Parameter files•Annotation files•Metadata files

Page 4: NGS - QC & Dataformat

Read output formats

454

Solexa/Illumina

SOLiD

04/11/234

Page 5: NGS - QC & Dataformat

454 output formats

.sff

.fna

.qual

04/11/235

Standard flowgram format

Page 6: NGS - QC & Dataformat

Illumina output formats

.seq.txt

.prb.txt

Illumina FASTQ (ASCII – 64 is Illumina score)

Qseq(ASCII – 64 is Phred score)

Illumina single line formatSCARF

04/11/236

Solexa Compact ASCII Read Format

Phred quality scores

Page 7: NGS - QC & Dataformat

ASCII value for h= 103 Quality of Base A at the position 1 = 103- 64 103- 64 = 39 Where 39 is the phred score

04/11/237

Illumina FastQ

Page 8: NGS - QC & Dataformat

SOLiD output format(s)

CSFASTA

04/11/238

color-space sequence reads in a fasta format

These reads can be retained and analyzed in color-space by software

The Format Conversion Tool offers options for cleaning of the CSFASTA files

Page 9: NGS - QC & Dataformat

Read Length

• Sanger reads lengths ~ 800-2000bp

• Generally we define short reads as anything below 200bp−Illumina (100bp – 250bp)−SoLID (75bp max)−Ion Torrent (200-300bp max – currently...)−Roche 454 – 400-800bp

• Even with these platforms it is cheaper to produce short reads (e.g. 50bp) rather than 100 or 200bp reads

• Diminishing returns:−For some applications 50bp is more than sufficient

−Resequencing of smaller organisms−Bacterial de-novo assembly −ChIP-Seq−Digital Gene Expression profiling−Bacterial RNA-seq

Page 10: NGS - QC & Dataformat

Common (“standard”) format for read alignments: Alignment/Assembly Format

SAM

BAM (= binary SAM)MAQ

04/11/2310

Page 11: NGS - QC & Dataformat

Sequencers & Sequence Assembly Packages

04/11/2311

Page 12: NGS - QC & Dataformat

Formats for Genome/Gene annotation

BED format (genome-browser tracks)

GFF format (gene/genome features)

BioXSD (XML) (any annotation; under development)

04/11/2312

Page 13: NGS - QC & Dataformat

If reads should be deposited in a public repository:

SRA (Short Read Archive) at NCBI

04/11/2313

Page 14: NGS - QC & Dataformat

For base-call data, “standard” FASTQ (Sanger, Phred)

For read alignments, SAM/BAM/MAQ format

For annotation results (e.g. GFF or BED format)

Points to remember on Data Formats

04/11/2314

Page 15: NGS - QC & Dataformat

QC analysis

04/11/2315

Page 16: NGS - QC & Dataformat

All platforms have errors

Illumina SoLID/ABI-Life Roche 454 Ion Torrent

1. Removal of low quality bases/ Low complexity regions2. Removal of adaptor sequences3. Homopolymer-associated base call errors (3 or more

identical DNA bases) causes higher number of (artificial) frameshifts

Page 17: NGS - QC & Dataformat

Illumina artefacts

under represented GC rich regions PCR Sequencing

GGC/GCC motif is associated with low quality and mismatches

Low quality reads < 20% phred score

Page 18: NGS - QC & Dataformat

Need for QC & Preprocessing

QC analysis of sequence data is extremely important for meaningful downstream analysis

To analyze problems in quality scores/ statistics of sequencing data

To check whether further analysis with sequence is possible

To remove redundancy (filtering)

To remove low quality reads from analysis

To remove adapter contamination

Highly efficient and fast processing tools are required to handle large volume of datasets

04/11/2318

Page 19: NGS - QC & Dataformat

The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification

Most of the programs available for downstream

analyses do not provide the utility for quality check and filtering of NGS data before processing

04/11/2319

Need for QC & Preprocessing

Page 20: NGS - QC & Dataformat

NGS QC Toolkit & FastQC

NGS QC Toolkit is for quality check and filtering of high-quality read

This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html

Application have been implemented in Perl programming language

QC of sequencing data generated using Roche 454 and Illumina platforms

Additional tools to aid QC : (sequence format converter and trimming tools) and analysis (statistics tools)

FastQC can be used only for preliminary analysis

04/11/2320

Page 21: NGS - QC & Dataformat

04/11/2321

Page 22: NGS - QC & Dataformat

04/11/2322

Page 23: NGS - QC & Dataformat

04/11/2323

NGSQC toolkit Output

Page 24: NGS - QC & Dataformat

04/11/2324

NGSQC toolkit Output

Page 25: NGS - QC & Dataformat

04/11/2325

Comparison - QC tools

Page 26: NGS - QC & Dataformat

FastQC Basic statistics Quality- Per base position Per Sequence Quality Distribution Nucleotide content per position Per sequence GC distribution Per base GC distribution Per base N content Length Distribution Overrepresented/ duplicated sequences K-mer content

04/11/2326

Page 27: NGS - QC & Dataformat

FastQC (Box-Whisker plot)

Y axis- Quality ScoreX axis- Base position

04/11/2327

Page 28: NGS - QC & Dataformat

2. Quality- Per base position04/11/2328

Page 29: NGS - QC & Dataformat

2. Quality- Per base position04/11/2329

Page 30: NGS - QC & Dataformat

3.Per Sequence Quality Distribution

04/11/2330

Page 31: NGS - QC & Dataformat

3. Per Sequence Quality Distribution

04/11/2331

Page 32: NGS - QC & Dataformat

4.Nucleotide content per position

04/11/2332

Page 33: NGS - QC & Dataformat

4. Nucleotide content per position

04/11/2333

Page 34: NGS - QC & Dataformat

5.Per sequence GC distribution

04/11/2334

Page 35: NGS - QC & Dataformat

5.Per sequence GC distribution

04/11/2335

Page 36: NGS - QC & Dataformat

6. Per base GC distribution04/11/2336

Page 37: NGS - QC & Dataformat

6. Per base GC distribution04/11/2337

Page 38: NGS - QC & Dataformat

7. Per base N content04/11/2338

Page 39: NGS - QC & Dataformat

7. Length Distribution04/11/2339

Page 40: NGS - QC & Dataformat

8. Kmer content04/11/2340

Any k-mer showing more than a 3 fold overall enrichment or a 5 fold enrichment at any given base position will be reported by this module.

Page 41: NGS - QC & Dataformat

9. Overrepresented/ duplicate sequences

The analysis of overrepresented sequences will spot an increase in any exactly duplicated sequences

Too many duplicate regions in the sequence will be due to sequencing problems

04/11/2341

This module will issue a warning if any sequence is found to represent more than 0.1% of the total.

Page 42: NGS - QC & Dataformat

QC Report Sequence StatisticsTotal No. of Sequences 6970943Avg. Sequence Length 54Max Sequence Length 54Min Sequence Length 54Total Sequence Length 376430922Total N bases 14254521% N bases 3.78676No of Sequences with Ns 278635% Sequences with Ns 3.99709

Quality StatisticsTotal HQ bases 334195496%HQ bases 88.78Total HQ reads 6350256%HQ reads 91.0961

04/11/2342

Alignment statistics