Introduction to next generation sequencing Rolf Sommer Kaas.
-
Upload
annis-king -
Category
Documents
-
view
220 -
download
2
Transcript of Introduction to next generation sequencing Rolf Sommer Kaas.
Introduction to next generation sequencing
Rolf Sommer Kaas
National Food Institute, Technical University of Denmark
Outline
Next generation sequencing
Ion Torrent454 PacBioIllumina
Output
Data Analysis
History
MinION
National Food Institute, Technical University of Denmark
Amiga 500
History
‘77‘72
Frederick
Sanger
Walter Gilbert
Alan Maxam
1980
1953
Watson & Crick
First Portable computer
IBM 5100
‘75
First Laptop
Osborne 1 (11kg)
1981
First computer 1951
1990
World Wide Web
National Food Institute, Technical University of Denmark
History1990-2003
Human genome project
1998
• Random Shotgun Sequencing
• Fast
• 300 mill. $
• Hierarchical Shotgun Sequencing
• 3 billion $
National Food Institute, Technical University of Denmark
History1990-2003
Human genome project
2001: Draft
2003: Complete
National Food Institute, Technical University of Denmark
History
‘77‘72
Frederick
Sanger
Walter Gilbert
Alan Maxam
1980
1953
Watson & Crick
First Portable computer
IBM 5100
‘75
First Laptop
Osborne 1 (11kg)
1981
First computer 1951
1990
World Wide Web
2003
Dell Laptop
National Food Institute, Technical University of Denmark
History2004
Next Generation Sequencing
454 Life Sciences: Parallelized pyrosequencing
Reduce costs 6 fold
National Food Institute, Technical University of Denmark
History2004
Next Generation Sequencing
(Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). Accessed 31-oct-14.)
European Nucleotide Archive (ENA)
(http://www.ebi.ac.uk/ena/about/statistics)
National Food Institute, Technical University of Denmark
Next generation sequencing
• Roche, 454 Life Sciences (GS FLX Titanium)
• Life Technologies (Ion Torrent & Ion Proton)
• Illumina (HiSeq, MiSeq, GenomeAnalyzer)
• Pacific Biosciences (PacBio RS)
• Oxford Nanopore (MinION, PromethION, GridION)
National Food Institute, Technical University of Denmark
Next generation sequencing
Method outline - library
1. Fragment DNA 2. Ligate adapters
Amplification primer
Sequencing primer
Barcode 3. Amplification
4. Sequencing
National Food Institute, Technical University of Denmark
Next generation sequencing technologies
Ion Torrent
Problem with homopolymers
Fast
Expensive
Long insert sizes
Low throughput
Cheapest
National Food Institute, Technical University of Denmark
Next generation sequencing
Illumina
Genome Analyzer HiSeq MiSeq
Short reads (~50-250 bp)
Good Accuracy
High Throughput
National Food Institute, Technical University of Denmark
Next generation sequencing technologies
PacBio Expensive
Lower accuracy
Long reads (~5000 bp)
National Food Institute, Technical University of Denmark
Next generation sequencing technologies
Nanopore
• Upcoming technology
• Released to select labs
National Food Institute, Technical University of Denmark
Next generation sequencing technologies
Nanopore
• Up to 80,000 bp reads
• MinION: 150 mill. Bp pr 6 h. (30x coverage of E. coli)
GridION
MinIONPromethION
National Food Institute, Technical University of Denmark
Next generation sequencing technologies
Machine distribution
• Illumina is the most common
• ABI SOLiD not as big as it appears
National Food Institute, Technical University of Denmark
Reads
Sample
Raw reads
Output
National Food Institute, Technical University of Denmark
What is sequence data?Sequence data is stored in fasta files
Fasta example:
Output
Header/ID
Sequence
National Food Institute, Technical University of Denmark
Handling sequence data?Watch out!Output
Same FASTA file in Word
This should be fine…
National Food Institute, Technical University of Denmark
Handling sequence data?Watch out!Output
What your data actually looks like!
Oh no! This wont work…
Take home message:
Use “pure text editors”Examples:
• Notepad (Win)
• Textedit (Mac)
• Sublime Text (all)
Save files in “txt” format.
National Food Institute, Technical University of Denmark
What is the data?Fastq files
What is Fastq?Fasta + quality scores
Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA
+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
1 read, 4 lines
Output
National Food Institute, Technical University of Denmark
What is the data?Fastq files
What is Fastq?Fasta + quality scores
Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA
+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
Header/ID
Output
National Food Institute, Technical University of Denmark
What is the data?Fastq files
What is Fastq?Fasta + quality scores
Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA
+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
DNA sequence
Output
National Food Institute, Technical University of Denmark
What is the data?Fastq files
What is Fastq?Fasta + quality scores
Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA
+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
Name field (optional)
Output
National Food Institute, Technical University of Denmark
What is the data?Fastq files
What is Fastq?Fasta + quality scores
Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA
+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
Quality scores
Output
National Food Institute, Technical University of Denmark
Paired and Single End
Single end readsInsert size (eg. 300 bp)
Paired end reads
Long Insert size (eg. 8000 bp)
Output
National Food Institute, Technical University of Denmark
Splitting & clipping data
Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA
+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
using barcodesOutput aka multiplexing
De-multiplexing is usually done by the sequencer
National Food Institute, Technical University of Denmark
Data qualityOutput
National Food Institute, Technical University of Denmark
Trimming data
Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaacc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
Output
National Food Institute, Technical University of Denmark
Trimming data
Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaacc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
OutputData quality
National Food Institute, Technical University of Denmark
Coverage & DepthOutput
Coverage: Average number of times the data is covered in the genome.
• N: Number of read
• L: Read length
• G: Genome size
Depth: Number reads that coveres a particular nucleotide in each position in
the genome.reads
site= depth
Data quality
(target or assembly)
Breadth-of-coverage:
assembly size
target sizeC =
Example:N = 5 millL = 100 bpG = 5 Mbp
C = 5*100/5 = 100X
On average, 100 reads covers each position in the genome.
________
Example:assembly = 4.9 mill
target = 5 mill
c = 4.9/5 = 0.98
________
National Food Institute, Technical University of Denmark
OutputData storage & Access
International Nucleotide Sequence Database Collaboration (INSDC)
Europe
European Bioinformatics Institute (EBI)
United States
National Center for Biotechnology
Information (NCBI)
Asia
DNA Data Bank of Japan (DDBJ)
24 h
National Food Institute, Technical University of Denmark
European Bioinformatics Institute (EBI)OutputData storage & Access
http://www.ebi.ac.uk/ena
National Food Institute, Technical University of Denmark
Assembly
Mapping to a reference
Further analysis (eg. Gene finding)
Further analysis (eg. SNP trees)
Data Analysis
Data splitting, clipping, and
trimming
Reference
De novo
National Food Institute, Technical University of Denmark
Unix DOS
Mac OS X Linux Windows
Bioinformatic tools Bioinformatic tools
CLC bio and MEGA
Geneious
Data AnalysisBioinformatic platforms
National Food Institute, Technical University of Denmark
Data AnalysisBioinformatic platforms
Unix…
National Food Institute, Technical University of Denmark
+ Platform independent
+ Requires little computer resources
+ Can be done everywhere
- Requires patience
• http://www.genomicepidemiology.org/ :
• MLST
• Resistance genes
• SNP calling and tree creation
• Species identification
• https://main.g2.bx.psu.edu/ :
• Many NGS tools
• Steep learning curve
Data AnalysisBioinformatic platforms
Web-tools to the rescue!
National Food Institute, Technical University of Denmark
Different sequencers requires different assemblers
• Depend on output and error profile
Assembler: Newbler
• 454
• Ion Torrent
Assembler: Velvet
• Illumina
• ABI Solid (color spaced)
Data AnalysisAssembly
De novo
National Food Institute, Technical University of Denmark
Velvet – The unnecessarily complex assembler
• K-mer based assembler
• User needs to set K
• Longer reads equals larger K
• Everything is defined in “Kmer-space”
• Nucleotide length = Kmer_length + K-1
• Kmer_coverage = Nucleotide_coverage * (Read_length-K+1)/Read_length
Data AnalysisAssembly
De novo
National Food Institute, Technical University of Denmark
Velvet assembly
Data AnalysisAssembly
De novo
Example
>NODE_1_ length_91928_cov_23.136574AGTTCATTGATAAATCTTTTTTGATTATCATCAACGAGTGCCCACACAGATTGATTGGTT
TATATTGTTAAAGAGCTTTTCCTATCGAAATCGCTTTTAAGCTCAATTCGCTAGGGCTGC
GTATATTACGCTTATTCAGTTGAGTGTCAAACGTTATTTTCTA...
K = 83
Kmer_length + K-1 = Nucleotide length
91928 + 83 – 1 = 92010
Kmer_coverage = Nucleotide_coverage * (Read_length-K+1)/Read_length
23.136574
(300 – 83 + 1) / 300
___________________ = 31.84
National Food Institute, Technical University of Denmark
De novo quality check
Number of contigs
- Fewer is generally better
N50
Total size of contigs
50% of size
Data Analysis
National Food Institute, Technical University of Denmark
De novo quality check
Number of contigs
- Fewer is better
N50
Total size of contigs
50% of size
Size of contig
Data Analysis
National Food Institute, Technical University of Denmark
Assembly
Further analysis (eg. Gene finding)
Data Analysis
Data splitting, clipping, and
trimming
Reference
De novo
National Food Institute, Technical University of Denmark
Contigs
Gene finding
Resistance
MLST
Etc.
Data AnalysisFurther data analysis
National Food Institute, Technical University of Denmark
• Find genes by Open Reading Frames + Shine-Dalgarno + motifs
• Not there does not mean it is NOT there
• Not assembled
• Truncated
• “Hypothetical” & “Putative” – The curse of bioinformatics
Annotated gene – verified in the lab
“Hypothetical” or “Putative” annotations
No match to original sequence
The evil circle of BLAST similarity
Suggested annotation service:
RAST: http://rast.nmpdr.org/
Data AnalysisFurther data analysis
Genes are not just genes…
National Food Institute, Technical University of Denmark
Assembly
Mapping to a reference
Further analysis (eg. Gene finding)
Data Analysis
Data splitting, clipping, and
trimming
Reference
De novo
National Food Institute, Technical University of Denmark
Mapping to a reference
raw readsDo not match any reads
Do not match reference
Reference sequence
Data Analysis
Mappers:
BWA
Bowtie
MAQ
CGE
National Food Institute, Technical University of Denmark
Assembly
Mapping to a reference
Further analysis (eg. Gene finding)
Further analysis (eg. SNP trees)
Data Analysis
Data splitting, clipping, and
trimming
Reference
De novo
National Food Institute, Technical University of Denmark
Thank you for listening
Questions?