Institute of Biomedical Sciences University of São Paulo
description
Transcript of Institute of Biomedical Sciences University of São Paulo
![Page 1: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/1.jpg)
Institute of Biomedical SciencesUniversity of São Paulo
Arthur Gruber Coccilab – ICB/USP
DNA Assembly and Mapping
![Page 2: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/2.jpg)
Next-generation sequencing platforms
• Mid 2000’s: next-generation sequencers (NGS) were developed
• 2004 – 454 (Roche, formerly 454 Life Sciences)
• 2006 – Illumina (formerly Solexa)
• 2008 – SOLiD (Life Technologies, formerly Applied Biosystems)
• 2011 –Ion Torrent /Proton (Life Technologies)
• 2011 – PacBio RS (Pacific Biosciences)• Massively parallel sequencing - tipo shotgun (random fragments)• Generate millions of sequences in one single run at a low cost per
base
![Page 3: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/3.jpg)
Data generation x cost
Moore LawCost per MB of sequence
Source: Sboner et al. (2011) - Genome Biol. 12 (8): 125
![Page 4: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/4.jpg)
Coccilab – ICB/USP
Evolution of sequencing costs
DNA Assembly
An estimate of the evolution of sequencing costs over the last 10 years. Costs are given for sequencing a megabase using a logarithmic scale. This curve is adapted from [15]. Time of introduction of new technologies is indicated.
Source: Delseny et al. (2010). Plant Science 179 (5): 407–422
![Page 5: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/5.jpg)
NGS – Lower cost and greater data generation
Source: Sboner et al. (2011) - Genome Biol. 12 (8): 125
![Page 6: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/6.jpg)
Next-generation sequencing platforms
Source: Glen (2011). Mol Ecol Resources 11: 759–769
![Page 7: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/7.jpg)
Next-generation sequencing platforms
Source: Glen (2011). Mol Ecol Resources 11: 759–769
![Page 8: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/8.jpg)
Next-generation sequencing platforms
Source: Glen (2011). Mol Ecol Resources 11: 759–769
![Page 9: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/9.jpg)
Coccilab – ICB/USP
Different types of sequencing methods
A flow chart of the different types of sequencing methods
Source: Delseny et al. (2010). Plant Science 179 (5): 407–422
![Page 10: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/10.jpg)
454 Workflow
Source:Mardis. (2008). Annu. Rev. Genomics Hum. Genet. 9: 387–402
![Page 11: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/11.jpg)
Illumina Workflow
Source: Mardis. (2008). Annu. Rev. Genomics Hum. Genet. 9: 387–402
![Page 12: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/12.jpg)
SOLiD Workflow
Source:Mardis. (2008). Annu. Rev. Genomics Hum. Genet. 9: 387–402
![Page 13: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/13.jpg)
Coccilab – ICB/USP
NGS platforms – applications
DNA Assembly
Source: Homer et al. (2009). Brief Bioinformatics II (2): 181-197.
![Page 14: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/14.jpg)
Coccilab – ICB/USP
NGS platforms – applications
DNA Assembly
Source: Homer et al. (2009). Brief Bioinformatics II (2): 181-197.
Tool Website Category Platform
![Page 15: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/15.jpg)
Sequence assembly
• Current sequencing platform can only generate sequence reads of dozens of bp (so called short reads) or some hundreds of reads (Sanger, 454, Ion Torrent, PacBio)
• Computational tools are necessary to assemble the sequence reads into a larger sequence segment/genome
• Sequence assemblers use two different approaches to assemble reads:• Overlap layout consensus• de Bruijn graphs
1
2
Schatz et al. (2010) - Assembly of large genomes using second-generation sequencing
![Page 16: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/16.jpg)
Coccilab – ICB/USP
K-mer graph
DNA Assembly
Source: Miller et al. (2010). Genomics 95: 315-327
A pair-wise overlap represented by a K-mer graph.
(a) Two reads have an error-free
overlap of 4 bases.
(b) One K-mer graph, with K=4, represents both reads. The pair-wise alignment is a by-product of the graph construction.
(c) The simple path through the graph implies a contig whose consensus sequence is easily reconstructed from the path.
![Page 17: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/17.jpg)
Coccilab – ICB/USP
Complexity in K-mer graphs
DNA Assembly
Source: Miller et al. (2010). Genomics 95: 315-327
Complexity in K-mer graphs can be diagnosed with read multiplicity information. In these graphs, edges represented in more reads are drawn with thicker arrows.
(a) An errant base call toward the end of a read causes a “spur” or short dead-end branch. The same pattern could be induced by coincidence of zero coverage after polymorphism near a repeat.
(b) An errant base call near a read middle causes a “bubble” or alternate path. Polymorphisms between donor chromosomes would be expected to induce a bubble with parity of read multiplicity on the divergent paths.
(c) Repeat sequences lead to the “frayed rope” pattern of convergent and divergent paths.
![Page 18: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/18.jpg)
de Bruijn Graphs
Advantages:• Can deal with large amounts of data, consolidates redundant reads (high coverage)
in a very efficient way • Sequencing errors are promptly identified from the topology of the graph and k-mer
coverage
de BRUIJN Graph
Erro
Edge formation in the graph
![Page 19: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/19.jpg)
Evaluating assemblies
• Size of Largest Contig• Number of contigs > n length• N50
Given a set of sequences of varying lengths, the N50 length is defined as the length N for which half of all bases in the sequences are in a sequence of length L < N. In other words, N50 is the contig length such that using equal or longer contigs produces half the bases of the genome. Therefore, the number of bases from of all sequences shorter than the N50 will equal the number of bases from all sequences longer than the N50.
![Page 20: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/20.jpg)
Evaluating assemblies
• N50Contig or scaffold N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value
![Page 21: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/21.jpg)
Some definitions
• ContigA sequence contig is a contiguous, overlapping sequence read resulting from the reassembly of the small DNA fragments generated by sequencing strategies• ScaffoldUsing paired-end sequencing technology, the distance between both sequence ends of a fragment is known. This gives additional information about the orientation of contigs constructed from these reads and allows for their assembly into scaffolds.
![Page 22: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/22.jpg)
Libraries for NGS platforms
![Page 23: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/23.jpg)
Coccilab – ICB/USP
Paired-end technology
DNA Assembly
Source: Delseny et al. (2010). Plant Science 179 (5): 407–422
A) Schematic drawing of the paired-end technology. Adaptors and genome fragments are represented respectively by the black and grey lines.
B) B) Strategy for sequencing large DNA fragments: short reads are assembled into contigs. A high coverage is required. In the next steps, paired-ends derived from larger fragments are used to assemble contigs into scaffolds.
![Page 24: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/24.jpg)
Contigs and scaffolds
![Page 25: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/25.jpg)
An example of a real file 454 data
![Page 26: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/26.jpg)
An example of real file 454 data
![Page 27: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/27.jpg)
An example of real file 454 data
![Page 28: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/28.jpg)
Coccilab – ICB/USP
NGS platforms – performances and features
DNA Assembly
Source: Homer et al. (2009). Brief Bioinformatics II (2): 181-197.
![Page 29: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/29.jpg)
Coccilab – ICB/USP
Comparison of De Novo Genome Assemblers
DNA Assembly
Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915.
![Page 30: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/30.jpg)
Coccilab – ICB/USP
Comparison of De Novo Genome Assemblers
DNA Assembly
Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915.
Accuracy and integrity for 36-mer datasets assembly. The quality of consequential contigs is shown with:
(A) the accuracy of assembled contigs
(B) the genome coverage of the assembled contigs. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset.
![Page 31: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/31.jpg)
Coccilab – ICB/USP
Comparison of De Novo Genome Assemblers
DNA Assembly
Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915.
Accuracy and integrity for 75-mer datasets assembly. The quality of consequential contigs is shown with:
(A) the accuracy of assembled contigs
(B) the genome coverage of the assembled contigs. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset.
![Page 32: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/32.jpg)
Coccilab – ICB/USP
Comparison of De Novo Genome Assemblers
DNA Assembly
Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915.
Statistics for assembled contigs of 36-mer short reads. Indicatrix that illustrates the feature of size distribution are adopted for analysis. ‘‘#’’ denotes the RAM of machine is not enough, and ‘‘N/A’’ means the data is not available. The N50 size and N80 size represent the maximum read length for which all contigs greater than or equal to the threshold covered 50% or 80% of the reference genome.
![Page 33: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/33.jpg)
Coccilab – ICB/USP
Comparison of De Novo Genome Assemblers
DNA Assembly
Source: Zhang et al. (2011). PLoS ONE 6 (3): e17915.
Statistics for assembled contigs of 75-mer short reads. Indicatrix that illustrates the feature of size distribution are adopted for analysis. ‘‘#’’ denotes the RAM of machine is not enough, and ‘‘N/A’’ means the data is not available. The N50 size and N80 size represent the maximum read length for which all contigs greater than or equal to the threshold covered 50% or 80% of the reference genome.
![Page 34: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/34.jpg)
Coccilab – ICB/USP
Genomes assembled de novo exclusively from Illumina short sequence reads
DNA Assembly
Organisms:
• Turkey (Meleagris gallopavo)• Giant panda (Ailuropoda melanoleuca)• Bacillus subtilis 168• Bacillus subtilis natto• Pseudomonas syringae pv. tabaci 11528• Pseudomonas syringae pv. syringae Psy642• Pseudomonas syringae pv. tomato T1• Pseudomonas syringae pv. Aesculi• Apple scab (Ventura inaequalis)• Pine (Pinus species) chloroplast
Paszkiewicz & Studholme (2010). Brief Bioinform 11 (5): 457-472.
![Page 35: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/35.jpg)
Coccilab – ICB/USP
Assembly results using real illumina single-end and paired-end reads from SRA
DNA Assembly
Source: Bao et al. (2011). Journal of Human Genetics 56: 406–414.
![Page 36: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/36.jpg)
Coccilab – ICB/USP
Biases in real short-read sequence data
DNA Assembly
Source: Paszkiewicz & Studholme (2010). Brief Bioinform 11 (5): 457-472.
(A) Illustrates the depth of coverage by aligned reads over the 6 Mb circular chromosome. Coverage is shallower around the 3 Mb region than it is near the origin of replication (position 0)
(B) Illustrates the expected frequency distribution of alignment depth, assuming random sampling of the genome
(C) Illustrates the observed frequency distribution of alignment depth, which is broader than the expected distribution, indicating greater variance due to biased sampling.
![Page 37: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/37.jpg)
Coccilab – ICB/USP
Limitations of next-generation genome sequence assembly
DNA Assembly
Source: Alkan et al. (2010). Nat Methods 8(1): 61-65.
Limitations:
• NGS technologies typically generate shorter sequences with higher error rates from relatively short insert libraries
• Assembly of longer repeats and duplications will suffer from this short read length
• Assembly methods for short reads are based on de Bruijn graph and Eulerian path approaches, which have difficulty in assembling complex regions of the genome.
• DNA contamination or insertion polymorphism?
![Page 38: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/38.jpg)
Coccilab – ICB/USP
Limitations of next-generation genome sequence assembly
DNA Assembly
Source: Alkan et al. (2010). Nat Methods 8(1): 61-65.
Limitations:
• Repeat content• WGS-based de novo sequence assembly algorithm will collapse
identical repeats, resulting in reduced or lost genomic complexity.• Missing and fragmented genes
![Page 39: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/39.jpg)
Coccilab – ICB/USP
Data generation and analysis steps of a typical RNA-seq experiment.
Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682.
![Page 40: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/40.jpg)
Coccilab – ICB/USP
Reference-based transcriptome assembly strategy
DNA Assembly
Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682.
![Page 41: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/41.jpg)
Coccilab – ICB/USP
Overview of the de novo transcriptome assembly strategy
Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682.
![Page 42: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/42.jpg)
Coccilab – ICB/USP
Alternative approaches for combined transcriptome assembly
DNA Assembly
Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682.
![Page 43: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/43.jpg)
Coccilab – ICB/USP
Software for transcriptome assembly
DNA Assembly
Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682.
![Page 44: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/44.jpg)
Coccilab – ICB/USP
Splice-aware short-read aligners
DNA Assembly
Source: Martin & Wang. (2011). Nature Reviews Genetics 12, 671-682.
![Page 45: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/45.jpg)
Coccilab – ICB/USP
Mapping reads onto a reference sequence
DNA Assembly
Programs:
• Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour.
• Available at http://bowtie-bio.sourceforge.net/index.shtml• SHRiMP is a software package for aligning genomic reads against a
target genome. Available at http://compbio.cs.toronto.edu/shrimp/• BarraCUDA - an ultra fast short read sequence alignment software
using GPUs. • Available at
http://www.many-core.group.cam.ac.uk/projects/lam.shtml • Burrows-Wheeler Aligner (BWA) is an efficient program that aligns
relatively short nucleotide sequences against a long reference sequence such as the human genome.
• Available at http://bio-bwa.sourceforge.net/
![Page 46: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/46.jpg)
Coccilab – ICB/USP
Mapping reads onto a reference sequence
Programs:
• BLAT is a bioinformatics software a tool which performs rapid mRNA/DNA and cross-species protein alignments
• Available at http://www.kentinformatics.com/products.html• BFAST facilitates the fast and accurate mapping of short reads to
reference sequences. Some advantages of BFAST include: • Speed: enables billions of short reads to be mapped quickly.• Accuracy: A priori probabilities for mapping reads with defined set
of variants.• An easy way to measurably tune accuracy at the expense of
speed.• Available at
http://sourceforge.net/apps/mediawiki/bfast/index.php?title=Main_Page
![Page 47: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/47.jpg)
Coccilab – ICB/USP
Visualizing reads mapped onto a reference sequence
Programs:
• TABLET - lightweight, high-performance graphical viewer for next generation sequence assemblies and alignments.
• Available at http://bioinf.scri.ac.uk/tablet/index.shtml• IGV - Integrative Genomics Viewer - a high-performance visualization
tool for interactive exploration of large, integrated genomic datasets. • Available at http://www.broadinstitute.org/igv/
![Page 48: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/48.jpg)
Coccilab – ICB/USP
TABLET - graphical viewer
![Page 49: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/49.jpg)
Coccilab – ICB/USP
Integrative Genomics Viewer (IGV)
![Page 50: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/50.jpg)
Coccilab – ICB/USP
Data formats - SOLiD
Color Space:
• Also known as 2-base (Di-Base) encoding, is based on ligation sequencing rather than sequencing by synthesis.
• Each base in this sequencing method is read twice. This changes the color of two adjacent color space calls, therefore in order to miscall a SNP, two adjacent colors must be miscalled.
• Requires specific software to manipulate the data. Most assemblers are not designed to deal with color space.
![Page 51: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/51.jpg)
Coccilab – ICB/USP
Data formats - SOLiD
SOLiD 4 – data is provided as *csfasta and *.qual
csfasta: >1_7_80_F3T223003300123201021020110010200020002200000000300000000001000020000110002200>1_7_157_F3T120030200320003020020010020100300003100031000300001000000000010000000000000>1_7_202_F3T230020100031001030000230000000200003100000000000003000000000010000000000000
qual: >1_7_80_F340 42 16 4 42 4 7 32 4 42 4 27 36 4 42 4 16 42 4 42 4 27 35 4 4 4 27 35 4 4 7 27 4 4 4 4 27 4 4 4 4 22 4 4 4 4 4 4 4 4 4 4 4 4 4 4 16 4 4 4 4 16 4 4 4 4 16 11 4 4 4 22 7 4 4 >1_7_157_F340 42 4 4 42 42 40 4 4 40 42 32 4 4 42 4 7 4 4 7 4 4 36 4 4 40 4 16 4 4 36 4 4 4 4 42 4 4 4 4 42 4 4 4 4 36 4 4 4 4 36 4 4 4 4 4 7 4 4 4 4 4 4 4 4 4 7 4 4 4 4 4 4 4 4 >1_7_202_F342 42 4 4 42 42 42 4 4 42 40 35 4 4 42 4 27 4 4 40 4 36 42 4 4 36 4 42 4 4 27 4 4 4 4 16 4 4 4 4 7 7 4 4 4 16 4 4 4 4 4 4 4 4 4 4 7 4 4 4 4 16 4 4 4 4 4 4 4 4 4 11 4 4 4
![Page 52: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/52.jpg)
Coccilab – ICB/USP
Data formats - SOLiD
Color space can be converted into DE (Double encoding)
Life Technologies provides a set of scripts (SOLiD™ de novo accessory tools 2.0) for conversion and data usage with Velvet assembler.
• The program prepares reads in the format accepted by Velvet assembly engine.
• The program removes first base and first color, double encodes reads (i.e., 0 for A,1 for C,2 for G,3 for T).
• After running the assembler, the DE contigs must be converted into base space.
![Page 53: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/53.jpg)
Coccilab – ICB/USP
Data formats - SOLiD
WSQ:
• Extensible Sequence (XSQ) format introduced with the 5500 series SOLiD Sequencer.
• Developed to store each call and quality value in a single byte, which results in file sizes that are up to 75% smaller.
• Binary format – can be converted into csfasta/qual and *.fastq formats using the SOLiDTM System XSQ Tools (available at Life Technologies)
![Page 54: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/54.jpg)
Coccilab – ICB/USP
Data formats – 454 platform
FASTA and QUAL:
• Files can be provided in FASTA (*.fna) and QUAL (*.qual) formats.
SFF:
• Standard Flowgram Format - equivalent of the scf/ab1/trace file for Sanger sequencing, contains information on the signal strength for each flow.
• Binary format – can be converted into FASTA/QUAL using a python script (sff_extract) or using sff2fastq script.
![Page 55: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/55.jpg)
Coccilab – ICB/USP
Data formats – Illumina
FASTAQ
• Originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its quality data.
• FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
• Adopted by the Illumina Genome Analyzer.• FASTQ has become an almost universal format. It is accepted by
many assemblers (e.g. Edena, Euler, Velvet, ABySS, etc. ) and sequence mapping programs (e.g. Bowtie, BFAST, SHRIMP, MOSAIK, etc.)
• FASTQ can be converted into FASTA using the FASTX-Toolkit.
![Page 56: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/56.jpg)
Coccilab – ICB/USP
Data formats – Illumina
FASTAQ
• Both the sequence letter and quality score are encoded with a single ASCII character for brevity.
• Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
• Line 2 is the raw sequence letters. • Line 3 begins with a '+' character and is optionally followed by the
same sequence identifier (and any description) again. • Line 4 encodes the quality values for the sequence in Line 2, and
must contain the same number of symbols as letters in the sequence.
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
![Page 57: Institute of Biomedical Sciences University of São Paulo](https://reader036.fdocuments.in/reader036/viewer/2022062323/5681662a550346895dd98ab5/html5/thumbnails/57.jpg)
Coccilab – ICB/USP
Data formats – Illumina
FASTAQ Encoding
• Sanger and Illumina use slightly different base quality calculations.
• SangerQsanger = -10 log10p
• Illumina (prior to version 1.3)Qillumina = -10 log10 [ p /(1-p)]
• Solexa/Illumina 1.0 format can encode a quality score from -5 to 62 using ASCII 59 to 126 (Solexa+64).
• Sanger format uses Phred quality from 0 to 93 using ASCII characters 33 to 126 (Phred+33):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | |33 59 64 73 104 126