NGS techniques and data

70
NGS techniques and data relevant for metagenomics analyse s Lex Nederbragt Norwegian Sequencing Center & Centre for Ecological and Evolutionary Synthesis University of Oslo

description

A talk for I gave for the 2011 metagenomics course at the Biological Dept. Univ. of Oslo April 2011

Transcript of NGS techniques and data

Page 1: NGS techniques and data

 NGS techniques and data relevant for metagenomics analyses

Lex NederbragtNorwegian Sequencing Center &

Centre for Ecological and Evolutionary SynthesisUniversity of Oslo

Page 2: NGS techniques and data

The sequence revolution

Stratton et al Nature 458, 719-724

Page 3: NGS techniques and data

The sequence revolution

Stratton et al Nature 458, 719-724

Page 4: NGS techniques and data

Norwegian Sequencing Center

www.sequencing.uio.no

Page 5: NGS techniques and data

This talk

• Technologies– 454– Illumina

• Topics– How does it work–What do you get– Quality check– Filtering

Page 6: NGS techniques and data

How does it work: 454

Page 7: NGS techniques and data

Library preparation

Shotgun library Amplicon library

Starting from DNA sample Starting from PCR product

Page 8: NGS techniques and data

Library preparation

Shotgun library

Fragmentation

Addition of adaptors

Fw

AFw

Rv B

A

Rv B

Amplicon library

Page 9: NGS techniques and data

Multiplexing

Fw

AFw

RvB

A

Rv B

Amplicon libraryA

Fw

Tag

Shotgun: tag in the adaptors

Page 10: NGS techniques and data

Amplification

Page 11: NGS techniques and data

Plate loading

Page 12: NGS techniques and data

Multiplexing

Flickr.com

2 lanes

4 lanes

8 lanes

16 lanes

Page 13: NGS techniques and data

Sequencing

PPi: pyrophosphate

Page 14: NGS techniques and data

Basecalling

Page 15: NGS techniques and data

Read length

500 bases

Page 16: NGS techniques and data

Coming soon

Page 17: NGS techniques and data

Single end

• Default single end sequencing• Special protocols for mate-pairs

Page 18: NGS techniques and data

How does it work: Illumina

Page 19: NGS techniques and data

Library preparation

Multiplexing: same as for 454

Page 20: NGS techniques and data

Bridge amplification

Metzker 2010 Nat Rev Genet.11(1):31-46

Page 21: NGS techniques and data

Bridge amplification

Metzker 2010 Nat Rev Genet.11(1):31-46

Page 22: NGS techniques and data

Multiplexing

Flowcell: 8 lanes

Page 23: NGS techniques and data

Sequencing

Metzker 2010 Nat Rev Genet.11(1):31-46

Reversible terminators

Page 24: NGS techniques and data

Basecalling

Metzker 2010 Nat Rev Genet.11(1):31-46

Page 25: NGS techniques and data

Read length

454 GS FLX Titanium Illumina HiSeq

500 bases

Page 26: NGS techniques and data

Paired-end

• Default paired-end sequencing– single end also possible

150– 600 bases

Page 27: NGS techniques and data

What do you get?

Page 28: NGS techniques and data

454 Throughput

• GS FLX Titanium per-run output:– Up to 1.5 million single-end reads– Up to 600 megabases (Mb, million bases)– Less for amplicons

Page 29: NGS techniques and data

Illumina throughput (HiSeq 2000)

• Variable length– 50,100, (soon 150)– single or paired-end

• per-run output:– Up to 1 billion (109) single-end– Up to 2 billion paired-end reads – Up to 200 gigabases (Gb, billion bases) – Soon: 3 times more reads and bases

Page 30: NGS techniques and data

What do you get? Errors!

http://www.it.bton.ac.uk/staff/je/java/jewl/tutorial/tutorial.html

Page 31: NGS techniques and data

Error profiles

454 GS FLX Titanium Illumina Genome Analyzer II

Page 32: NGS techniques and data

454 specific

3 G's? 4 G's?

Page 33: NGS techniques and data

Illumina specific

• Substitutions– e.g. AG

• Underrepresentation of AT and GC rich regions

Page 34: NGS techniques and data

Solving errors

• Oversampling

Page 35: NGS techniques and data

Oversampling: 454

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATT-GGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATT-GGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGG

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGG

Undercall in two reads

Overcall in three reads

Consensus

Page 36: NGS techniques and data

Solving errors

• Oversampling• 454 amplicons: AmpliconNoise– this course

• Illumina GC-bias: PCR conditions– Aird et al. Genome Biology 2011, 12:R18

Page 37: NGS techniques and data

Duplicate reads

• Illumina: PCR step in library prep• 454: two beads in one microreactor– emulsion PCR

Page 38: NGS techniques and data

Chimeras

Haas B J et al. Genome Res. 2011;21:494-504

Page 39: NGS techniques and data

Chimeras

• 454 FLX Titanium– chimera rate of up to 20% 

• >70% of sequences representing particular genera 

Haas B J et al. Genome Res. 2011;21:494-504

Page 40: NGS techniques and data

Chimeras: solutions

• ChimeraSlayer– AmpliconNoise

• ChimeraCheck–Mothur

• See Haas et al. 2011 Genome Res. 21:494-504

Page 41: NGS techniques and data

What do you get? Bytes!

Page 42: NGS techniques and data

Filesizes

• 454– Up to 2 Gbytes per lane (sff)– two lanes

• HiSeq– up to 20 Gb per lane (fastq)– eight lanes

Page 43: NGS techniques and data

Datafiles 454

• sff file (standard flowgram format)– binary

• fasta & qual– text

Page 44: NGS techniques and data

454: sff file (text format)

>F7K88GK01BMPI0Run Prefix: R_2009_12_18_15_27_42_Region #: 1XY Location: 0551_2346

Run Name: R_2009_12_18_15_27_42_FLX########_Administrator_yourrunnameAnalysis Name: D_2009_12_19_01_11_43_XX_fullProcessingFull Path: /data/R_2009_12_18_15_27_42_FLX########_Administrator_yourrunname/D_2009_12_19_01_11_43_XX_fullProcessing/

Read Header Len: 32Name Length: 14# of Bases: 500Clip Qual Left: 15Clip Qual Right: 490Clip Adap Left: 0Clip Adap Right: 0

Flowgram: 1.03 0.00 1.01 0.02 0.00 0.96 0.00 1.00 0.00 1.04 0.00 0.00 0.97 0.00 0.96 0.02 0.00 1.04 0.01 1.04 0.00 0.97 0.96 0.02 0.00 1.00 0.95 1.04 0.00 0.00 2.04 0.02 0.03 1.05 Flow Indexes: 1 3 6 8 10 13 15 18 20 22 23 26 27 28 31 31 34 35 37 37 37 40 43 45 47 47 47 50 53 53 53 55 58 60 63 66 67 67 67 67 70 71 71 74 74 76 79 82 83 86 86 88 88 91 93 96 97...Bases: tcagatcagacacgCCACTTTGCTCCCATTTCAGCACCCCACCAAGCACAAGGCTGTCATCCCAATTGGACGGACAGATATGAGGTTAGCATTGGAAACCAATTCAGTCCCTAATTATTCACGACTGAACCCAGCGACAATTGGACATGGATTCATTTTTCAACTTGATTTGTTGTTGTAAAAGCA...Quality Scores: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 38 38 38 40 40 40 39 39 39 40 34 34 34 40 40 40 40 39 26 26 26 26 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 ...

Page 45: NGS techniques and data

454: fasta and qual files

Fasta:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAATTGTCCCTTTGACATAACGACTAAAGGAGTCAACAGATTTTCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACGCTATT...

Qual:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

40 40 39 39 39 40 40 40 40 40 40 40 40 38 31 26 26 16 16 16 20 20 14 14 14 14 27 33 32 35 36 33 36 35 36 38 35 20 20 21 24 24 22 36 39 40 38 38 38 40 40 40 40 40 40 37 37 37 33 3329 36 38 38 38 38 38 38 38 35 20 21 21 21 31 36 37 40 40 35 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40...

Sanger-style Phred scores

Page 46: NGS techniques and data

454: fasta and qual files

Fasta:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAATTGTCCCTTTGACATAACGACTAAAGGAGTCAACAGATTTTCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACGCTATT...

Qual:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

40 40 39 39 39 40 40 40 40 40 40 40 40 38 31 26 26 16 16 16 20 20 14 14 14 14 27 33 32 35 36 33 36 35 36 38 35 20 20 21 24 24 22 36 39 40 38 38 38 40 40 40 40 40 40 37 37 37 33 3329 36 38 38 38 38 38 38 38 35 20 21 21 21 31 36 37 40 40 35 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40...

Sanger-style Phred scores

chance of being wrong: 1:104.0 = 1:10000

chance of being wrong: 1:103.5 = 1:3162

Page 47: NGS techniques and data

Illumina: fastq file

@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1CCAACATAGCTGGATGCCAACATAGCTGGATTGTTATAGCTGGTTTGCTTTTCTAACTCGCTGGAAGTTTATAAGCATTCCTACTATTTCATAGTATTAC+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1BBbfYcbV^BV`cQffaBZfB_fdfUYaa]`adcbfef\acfd^cad^fOabRceb`beSbdfaad_e^^dbeedTbd`V\ecdfffYBddb^fa\d\de

Quality score as characters: Phred score = ASCII value -33'B' is ASCII 66  Phred 33

Page 48: NGS techniques and data

Illumina: fastq file

@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1CCAACATAGCTGGATGCCAACATAGCTGGATTGTTATAGCTGGTTTGCTTTTCTAACTCGCTGGAAGTTTATAAGCATTCCTACTATTTCATAGTATTAC+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1BBbfYcbV^BV`cQffaBZfB_fdfUYaa]`adcbfef\acfd^cad^fOabRceb`beSbdfaad_e^^dbeedTbd`V\ecdfffYBddb^fa\d\de

Matching pair in the other file:+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/2

Page 49: NGS techniques and data

FastQ formats

Cock PJ et al 2009

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Nucleic Acids Res. 2010 Apr;38(6):1767-71. 

and

http://en.wikipedia.org/wiki/Fastq

Page 50: NGS techniques and data

Quality control

Page 51: NGS techniques and data

Quality Control

• 454 (and others): Prinseq• Illumina (and others): fastQC, fastQA, etc

Page 52: NGS techniques and data

Prinseq

• http://edwards.sdsu.edu/prinseq_beta• Web-based and stand-alone• Upload – fasta file– qual file (optional)

Page 53: NGS techniques and data

Prinseq: read length

Page 54: NGS techniques and data

Prinseq: quality per position

Page 55: NGS techniques and data

Prinseq: quality values

Page 56: NGS techniques and data

Prinseq: duplicate reads

Page 57: NGS techniques and data

Prinseq: adaptors

No tag

Barcode (Roche 'MID')

Transcriptome library adaptor

Page 58: NGS techniques and data

Prinseq: contamination

The dinucleotideodds ratios*

 Principal component 

analysis (PCA)

*dinucleotide frequencies normalized for the base composition

Page 59: NGS techniques and data

FastQC

• http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

• Stand-alone• GUI (Java based)• Upload – fasta file– qual file (optional)

Page 60: NGS techniques and data

FastQC: quality per position

Page 61: NGS techniques and data

FastQC: quality per position

Page 62: NGS techniques and data

FastQC: quality values 

Page 63: NGS techniques and data

FastQC: nucleotide composition 

Page 64: NGS techniques and data

FastQC: GC distribution 

Page 65: NGS techniques and data

FastQC: duplicated reads 

Page 66: NGS techniques and data

Filtering/trimming

• Adaptor removal – especially Illumina

• Duplicate removal• Filtering for low quality bases– or stretches of them– reads with 'N's

• E.g. – fastX toolkit– prinseq

Page 67: NGS techniques and data

Other technologies

• Life Technologies– SOLiD– ionTorrent– not much used for metagenomics

• Pacific Biosciences– PacBio RS– large potential

Page 68: NGS techniques and data

Pacific Biosciences

Metzker 2010 Nat Rev Genet.11(1):31-46

Zero Mode Waveguides

Page 69: NGS techniques and data

Pacific Biosciences

Metzker 2010 Nat Rev Genet.11(1):31-46