NGS Data Preprocessing

30
NGS data Preprocessing Babelomics Granada, June 2011 Javier Santoyo-Lopez [email protected] http://bioinfo.cipf.es Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain)

description

NGS Data Preprocessing -Javier Santoyo -Massive sequencing data analysis workshop -Granada 2011

Transcript of NGS Data Preprocessing

Page 1: NGS Data Preprocessing

NGS data Preprocessing

Babelomics

Granada, June 2011

Javier [email protected]

http://bioinfo.cipf.esGenomics Department

Centro de Investigacion Principe Felipe (CIPF)(Valencia, Spain)

Page 2: NGS Data Preprocessing

History of DNA Sequencing History of DNA Sequencing

Avery: Proposes DNA as ‘Genetic Material’

Watson & Crick: Double Helix Structure of DNA

Holley: Sequences Yeast tRNAAla

1870

1953

1940

1965

1970

1977

1980

1990

2002

Miescher: Discovers DNA

Wu: Sequences λ Cohesive End DNA

Sanger: Dideoxy Chain TerminationGilbert: Chemical Degradation

Messing: M13 Cloning

Hood et al.: Partial Automation

Cycle Sequencing Improved Sequencing Enzymes Improved Fluorescent Detection Schemes

1986

Next Generation SequencingImproved enzymes and chemistryImproved image processing

Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)

1

15

150

50,000

25,000

1,500

200,000

50,000,000

Efficiency(bp/person/year)

15,000

100,000,000,000 2008

Page 3: NGS Data Preprocessing

Sequence Databases TrendEMBL database growth (March 2011)

Page 4: NGS Data Preprocessing

From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069

Page 5: NGS Data Preprocessing

06/14/11

NGS platforms comparison

Source http://www.clcngs.com/2008/12/ngs-platforms-overview/

Page 6: NGS Data Preprocessing

read length

base

s p

er

ma

chin

e r

un

10 bp 1,000 bp100 bp

1 Gb

100 Mb

10 Mb

10 Gb

ABI capillary sequencer

454 GS FLX Titanium

0.4-1 Gb, 100-500 bp reads

Illumina HiSeq150Gb, 100bp reads

1 Mb

(0.04-0.08 Mb,450-800 bp reads

100 Gb

AB SOLiDv3200Gb, 50 bp reads

Adapted from John McPherson, OICRAdapted from John McPherson, OICR

Next-gen sequencers

Page 7: NGS Data Preprocessing

Many Gbs of Sequences and...

• Data management becomes a challenge.– Moving data across file systems takes time (several hundred Gbs)

• What structure has the data?– Different sequencers output different files, but

– There are some data formats that are being accepted widely (e.g. FastQ format)

• Raw sequence data formats– SFF

– Fasta, csfasta

– Qual file

– Fastq

Page 8: NGS Data Preprocessing

Fasta & Fastq formats

• FastA format (everybody knows about it)– Header line starts with “>” followed by a sequence ID

– Sequence (string of nt).

• FastQ format– First is the sequence (like Fasta but starting with “@”)

– Then “+” and sequence ID (optional) and in the following line are QVs encoded as single byte ASCII codes

• Different quality encode variants

• Nearly all downstream analysis take FastQ as input sequence

Page 9: NGS Data Preprocessing

Sequence to Variation Workflow

RawData

FastQ

SAM

BWA/BWASW

SAMTools

BAM

SamtoolsmPileup

Pileup

BCF Tools

BCFToolsRaw

VCFFilterVCF

GFF

IGV

Page 10: NGS Data Preprocessing

Sequence to Variation Workflow

RawData

FastQ

SAM

BWA/BWASW

FastXTookit FastQ

BWA/BWASW

SAMTools

BAM

SamtoolsmPileup

Pileup

BCF Tools

BCFToolsRaw

VCFFilterVCF

GFF

IGV

Page 11: NGS Data Preprocessing

Why Quality Control and Preprocessing?

Sequencer output:

Reads + quality Is the quality of my sequenced data OK?

Page 12: NGS Data Preprocessing

Why Quality Control and Preprocessing?

Sequencer output:

Reads + quality Is the quality of my sequenced data OK?

If something is wrong can I fix it?

Page 13: NGS Data Preprocessing

Why Quality Control and Preprocessing?

Sequencer output:

Reads + quality Is the quality of my sequenced data OK?

If something is wrong can I fix it? Problem:

HUGE files...

Page 14: NGS Data Preprocessing

Why Quality Control and Preprocessing?

Sequencer output:

Reads + quality Is the quality of my sequenced data OK?

If something is wrong can I fix it? Problem:

HUGE files... How do they look?

Files are flat files and are big... tens of Gbs (please... don't use MS word to see or edit them)

@HWUSI­EAS460:2:1:368:1089#0/1TACGTACGTACGTACGTACGTAGATCGGAAGAGCGG+HWUSI­EAS460:2:1:368:1089#0/1aa[a_a_a^a^a]VZ]R^P[]YNSUTZBBBBBBBBB

@HWUSI­EAS460:2:1:368:528#0/1CTATTATAATATGACCGACCAGCTAGATCTACAGTC+HWUSI­EAS460:2:1:368:528#0/1abbbbaaaabba^aa`Y``aa`aaa``a`a_\_`[_

Page 15: NGS Data Preprocessing

Sequence Quality Per base Position

Good data

Consistent

High quality along the read

* The central red line is the median value * The yellow box represents the inter-quartile range (25-75%) * The upper and lower whiskers represent the 10% and 90% points * The blue line represents the mean quality

Page 16: NGS Data Preprocessing

Sequence Quality Per base Position

Bad data

High variance

Quality decrease with length

Page 17: NGS Data Preprocessing

Per Sequence Quality Distribution

Good data

Most are high-quality sequences

Page 18: NGS Data Preprocessing

Per Sequence Quality Distribution

Bad data

Not uniform distribution

Low Quality Reads

Page 19: NGS Data Preprocessing

Nucleotide Contentper position

Good data

Smooth over length

Organism dependent (GC)

Page 20: NGS Data Preprocessing

Nucleotide Contentper position

Bad data

Sequence position bias

Page 21: NGS Data Preprocessing

GC Distribution

Good data

Fits with the expected

Organism dependent

Page 22: NGS Data Preprocessing

Per sequenceGC Distribution

Bad data

It does not fit with expected

Organism dependent

Library contamination?

Page 23: NGS Data Preprocessing

Per baseGC Distribution

Good data

No variation across read sequence

Page 24: NGS Data Preprocessing

Per baseGC Distribution

Bad data

Variation across read sequence

Page 25: NGS Data Preprocessing

Per baseN content

It's not good if there are N bias per base position

Page 26: NGS Data Preprocessing

Duplicated Sequences

I don't expect high number of duplicated sequences:

PCR artifact?

Page 27: NGS Data Preprocessing

Distribution Length

This is descriptive.

Some sequencers output sequences of different length (e.g. 454)

Page 28: NGS Data Preprocessing

Overrepresented Sequences

Question:

If you obtain the exact same sequence too many times Do you have a problem?

Answer:

Sometimes!

Examples PCR primers (Illumina) GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT

CGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC

Page 29: NGS Data Preprocessing

K-mer Content

Helps to detect problems

Adapters?

Page 30: NGS Data Preprocessing

Practical:FastQC and Fastx-toolkit

Use FastQC to see your starting state. Use Fastx-toolkit to optimize different datasets

and then visualize the result with FastQC to prove your success!

Hints: Try trimming, clipping and quality filtering.

Go to the tutorial and try the exercises...