Introduction to next generation sequencing Rolf Sommer Kaas.

49
Introduction to next generation sequencing Rolf Sommer Kaas

Transcript of Introduction to next generation sequencing Rolf Sommer Kaas.

Page 1: Introduction to next generation sequencing Rolf Sommer Kaas.

Introduction to next generation sequencing

Rolf Sommer Kaas

Page 2: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Outline

Next generation sequencing

Ion Torrent454 PacBioIllumina

Output

Data Analysis

History

MinION

Page 3: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Amiga 500

History

‘77‘72

Frederick

Sanger

Walter Gilbert

Alan Maxam

1980

1953

Watson & Crick

First Portable computer

IBM 5100

‘75

First Laptop

Osborne 1 (11kg)

1981

First computer 1951

1990

World Wide Web

Page 4: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

History1990-2003

Human genome project

1998

• Random Shotgun Sequencing

• Fast

• 300 mill. $

• Hierarchical Shotgun Sequencing

• 3 billion $

Page 5: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

History1990-2003

Human genome project

2001: Draft

2003: Complete

Page 6: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

History

‘77‘72

Frederick

Sanger

Walter Gilbert

Alan Maxam

1980

1953

Watson & Crick

First Portable computer

IBM 5100

‘75

First Laptop

Osborne 1 (11kg)

1981

First computer 1951

1990

World Wide Web

2003

Dell Laptop

Page 7: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

History2004

Next Generation Sequencing

454 Life Sciences: Parallelized pyrosequencing

Reduce costs 6 fold

Page 8: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

History2004

Next Generation Sequencing

(Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). Accessed 31-oct-14.)

European Nucleotide Archive (ENA)

(http://www.ebi.ac.uk/ena/about/statistics)

Page 9: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Next generation sequencing

• Roche, 454 Life Sciences (GS FLX Titanium)

• Life Technologies (Ion Torrent & Ion Proton)

• Illumina (HiSeq, MiSeq, GenomeAnalyzer)

• Pacific Biosciences (PacBio RS)

• Oxford Nanopore (MinION, PromethION, GridION)

Page 10: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Next generation sequencing

Method outline - library

1. Fragment DNA 2. Ligate adapters

Amplification primer

Sequencing primer

Barcode 3. Amplification

4. Sequencing

Page 11: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Next generation sequencing technologies

Ion Torrent

Problem with homopolymers

Fast

Expensive

Long insert sizes

Low throughput

Cheapest

Page 12: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Next generation sequencing

Illumina

Genome Analyzer HiSeq MiSeq

Short reads (~50-250 bp)

Good Accuracy

High Throughput

Page 13: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Next generation sequencing technologies

PacBio Expensive

Lower accuracy

Long reads (~5000 bp)

Page 14: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Next generation sequencing technologies

Nanopore

• Upcoming technology

• Released to select labs

Page 15: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Next generation sequencing technologies

Nanopore

• Up to 80,000 bp reads

• MinION: 150 mill. Bp pr 6 h. (30x coverage of E. coli)

GridION

MinIONPromethION

Page 16: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Next generation sequencing technologies

Machine distribution

• Illumina is the most common

• ABI SOLiD not as big as it appears

Page 17: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Reads

Sample

Raw reads

Output

Page 18: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

What is sequence data?Sequence data is stored in fasta files

Fasta example:

Output

Header/ID

Sequence

Page 19: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Handling sequence data?Watch out!Output

Same FASTA file in Word

This should be fine…

Page 20: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Handling sequence data?Watch out!Output

What your data actually looks like!

Oh no! This wont work…

Take home message:

Use “pure text editors”Examples:

• Notepad (Win)

• Textedit (Mac)

• Sublime Text (all)

Save files in “txt” format.

Page 21: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

What is the data?Fastq files

What is Fastq?Fasta + quality scores

Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1

ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA

+

_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB

@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1

ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC

+

bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT

@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1

AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT

+

bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc

@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1

AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG

+

bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

1 read, 4 lines

Output

Page 22: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

What is the data?Fastq files

What is Fastq?Fasta + quality scores

Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1

ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA

+

_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB

@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1

ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC

+

bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT

@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1

AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT

+

bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc

@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1

AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG

+

bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

Header/ID

Output

Page 23: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

What is the data?Fastq files

What is Fastq?Fasta + quality scores

Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1

ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA

+

_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB

@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1

ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC

+

bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT

@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1

AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT

+

bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc

@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1

AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG

+

bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

DNA sequence

Output

Page 24: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

What is the data?Fastq files

What is Fastq?Fasta + quality scores

Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1

ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA

+

_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB

@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1

ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC

+

bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT

@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1

AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT

+

bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc

@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1

AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG

+

bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

Name field (optional)

Output

Page 25: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

What is the data?Fastq files

What is Fastq?Fasta + quality scores

Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1

ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA

+

_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB

@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1

ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC

+

bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT

@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1

AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT

+

bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc

@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1

AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG

+

bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

Quality scores

Output

Page 26: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Paired and Single End

Single end readsInsert size (eg. 300 bp)

Paired end reads

Long Insert size (eg. 8000 bp)

Output

Page 27: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Splitting & clipping data

Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1

ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA

+

_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB

@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1

ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC

+

bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT

@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1

AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT

+

bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc

@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1

AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG

+

bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

using barcodesOutput aka multiplexing

De-multiplexing is usually done by the sequencer

Page 28: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Data qualityOutput

Page 29: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Trimming data

Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1

ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA+

_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB

@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1

ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC

+

bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaacc[ab_`]`[_b`^BBBBBBBB

@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1

AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT

+

bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc

@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1

AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG

+

bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

Output

Page 30: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Trimming data

Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1

ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA+

_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB

@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1

ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC

+

bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaacc[ab_`]`[_b`^BBBBBBBB

@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1

AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT

+

bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc

@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1

AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG

+

bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

OutputData quality

Page 31: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Coverage & DepthOutput

Coverage: Average number of times the data is covered in the genome.

• N: Number of read

• L: Read length

• G: Genome size

Depth: Number reads that coveres a particular nucleotide in each position in

the genome.reads

site= depth

Data quality

(target or assembly)

Breadth-of-coverage:

assembly size

target sizeC =

Example:N = 5 millL = 100 bpG = 5 Mbp

C = 5*100/5 = 100X

On average, 100 reads covers each position in the genome.

________

Example:assembly = 4.9 mill

target = 5 mill

c = 4.9/5 = 0.98

________

Page 32: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

OutputData storage & Access

International Nucleotide Sequence Database Collaboration (INSDC)

Europe

European Bioinformatics Institute (EBI)

United States

National Center for Biotechnology

Information (NCBI)

Asia

DNA Data Bank of Japan (DDBJ)

24 h

Page 33: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

European Bioinformatics Institute (EBI)OutputData storage & Access

http://www.ebi.ac.uk/ena

Page 34: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Assembly

Mapping to a reference

Further analysis (eg. Gene finding)

Further analysis (eg. SNP trees)

Data Analysis

Data splitting, clipping, and

trimming

Reference

De novo

Page 35: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Unix DOS

Mac OS X Linux Windows

Bioinformatic tools Bioinformatic tools

CLC bio and MEGA

Geneious

Data AnalysisBioinformatic platforms

Page 36: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Data AnalysisBioinformatic platforms

Unix…

Page 37: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

+ Platform independent

+ Requires little computer resources

+ Can be done everywhere

- Requires patience

• http://www.genomicepidemiology.org/ :

• MLST

• Resistance genes

• SNP calling and tree creation

• Species identification

• https://main.g2.bx.psu.edu/ :

• Many NGS tools

• Steep learning curve

Data AnalysisBioinformatic platforms

Web-tools to the rescue!

Page 38: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Different sequencers requires different assemblers

• Depend on output and error profile

Assembler: Newbler

• 454

• Ion Torrent

Assembler: Velvet

• Illumina

• ABI Solid (color spaced)

Data AnalysisAssembly

De novo

Page 39: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Velvet – The unnecessarily complex assembler

• K-mer based assembler

• User needs to set K

• Longer reads equals larger K

• Everything is defined in “Kmer-space”

• Nucleotide length = Kmer_length + K-1

• Kmer_coverage = Nucleotide_coverage * (Read_length-K+1)/Read_length

Data AnalysisAssembly

De novo

Page 40: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Velvet assembly

Data AnalysisAssembly

De novo

Example

>NODE_1_ length_91928_cov_23.136574AGTTCATTGATAAATCTTTTTTGATTATCATCAACGAGTGCCCACACAGATTGATTGGTT

TATATTGTTAAAGAGCTTTTCCTATCGAAATCGCTTTTAAGCTCAATTCGCTAGGGCTGC

GTATATTACGCTTATTCAGTTGAGTGTCAAACGTTATTTTCTA...

K = 83

Kmer_length + K-1 = Nucleotide length

91928 + 83 – 1 = 92010

Kmer_coverage = Nucleotide_coverage * (Read_length-K+1)/Read_length

23.136574

(300 – 83 + 1) / 300

___________________ = 31.84

Page 41: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

De novo quality check

Number of contigs

- Fewer is generally better

N50

Total size of contigs

50% of size

Data Analysis

Page 42: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

De novo quality check

Number of contigs

- Fewer is better

N50

Total size of contigs

50% of size

Size of contig

Data Analysis

Page 43: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Assembly

Further analysis (eg. Gene finding)

Data Analysis

Data splitting, clipping, and

trimming

Reference

De novo

Page 44: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Contigs

Gene finding

Resistance

MLST

Etc.

Data AnalysisFurther data analysis

Page 45: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

• Find genes by Open Reading Frames + Shine-Dalgarno + motifs

• Not there does not mean it is NOT there

• Not assembled

• Truncated

• “Hypothetical” & “Putative” – The curse of bioinformatics

Annotated gene – verified in the lab

“Hypothetical” or “Putative” annotations

No match to original sequence

The evil circle of BLAST similarity

Suggested annotation service:

RAST: http://rast.nmpdr.org/

Data AnalysisFurther data analysis

Genes are not just genes…

Page 46: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Assembly

Mapping to a reference

Further analysis (eg. Gene finding)

Data Analysis

Data splitting, clipping, and

trimming

Reference

De novo

Page 47: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Mapping to a reference

raw readsDo not match any reads

Do not match reference

Reference sequence

Data Analysis

Mappers:

BWA

Bowtie

MAQ

CGE

Page 48: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Assembly

Mapping to a reference

Further analysis (eg. Gene finding)

Further analysis (eg. SNP trees)

Data Analysis

Data splitting, clipping, and

trimming

Reference

De novo

Page 49: Introduction to next generation sequencing Rolf Sommer Kaas.

National Food Institute, Technical University of Denmark

Thank you for listening

Questions?