Next-generation sequencing: informatics & software aspects
-
Upload
neve-hoover -
Category
Documents
-
view
20 -
download
0
description
Transcript of Next-generation sequencing: informatics & software aspects
![Page 1: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/1.jpg)
Next-generation sequencing:informatics & software
aspects
Gabor T. MarthBoston College Biology Department
![Page 2: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/2.jpg)
Next-gen data
![Page 3: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/3.jpg)
Read length
read length [bp]0 100 200 300
~200-450 (variable)
25-70 (fixed)
25-50 (fixed)
20-60 (variable)
400
![Page 4: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/4.jpg)
Paired fragment-end reads
• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency
Korbel et al. Science 2007
• paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)• instrumental for structural variation discovery
• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity
![Page 5: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/5.jpg)
Representational biases
• this affects genome resequencing (deeper starting read coverage is needed)• will have major impact is on counting applications
“dispersed” coverage distribution
![Page 6: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/6.jpg)
Amplification errors
many reads from clonal copies of a single fragment
• early PCR errors in “clonal” read copies lead to false positive allele calls
early amplification error gets propagated into every clonal copy
![Page 7: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/7.jpg)
Read quality
![Page 8: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/8.jpg)
Error rate (Solexa)
![Page 9: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/9.jpg)
Error rate (454)
![Page 10: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/10.jpg)
Per-read errors (Solexa)
![Page 11: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/11.jpg)
Per read errors (454)
![Page 12: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/12.jpg)
Applications
![Page 13: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/13.jpg)
Genome resequencing for variation discovery
SNPs
short INDELs
structural variations
• the most immediate application area
![Page 14: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/14.jpg)
Genome resequencing for mutational profiling
Organismal reference sequence
• likely to change “classical genetics” and mutational analysis
![Page 15: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/15.jpg)
De novo genome sequencing
Lander et al. Nature 2001
• difficult problem with short reads
• promising, especially as reads get longer
![Page 16: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/16.jpg)
Identification of protein-bound DNA
Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)
Transcription binding sites. (Robertson et al. Nature Methods, 2007)
DNA methylation. (Meissner et al. Nature 2008)
• natural applications for next-gen. sequencers
![Page 17: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/17.jpg)
Transcriptome sequencing: transcript discovery
Mortazavi et al. Nature Methods 2008
Ruby et al. Cell, 2006
• high-throughput, but short reads pose challenges
![Page 18: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/18.jpg)
Transcriptome sequencing: expression profiling
Jones-Rhoads et al. PLoS Genetics, 2007
Cloonan et al. Nature Methods, 2008
• high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays
![Page 19: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/19.jpg)
Analysis software(resequencing)
![Page 20: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/20.jpg)
Individual resequencing
(iii) read assembly
REF
(ii) read mapping
IND
(i) base calling
IND(iv) SNP and short INDEL calling
(vi) data validation, hypothesis generation
(v) SV calling
![Page 21: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/21.jpg)
The variation discovery “toolbox”
• base callers
• read mappers
• SNP callers
• SV callers
• assembly viewers
GigaBayesGigaBayes
![Page 22: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/22.jpg)
1. Base calling
base sequence
base quality (Q-value) sequence
diverse chemistry & sequencing error profiles
![Page 23: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/23.jpg)
454 pyrosequencer error profile
• multiple bases in a homo-polymeric run are incorporated in a single incorporation test the number of bases must be determined from a single scalar signal the majority of errors are INDELs
![Page 24: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/24.jpg)
454 base quality values
• the native 454 base caller assigns too low base quality values
![Page 25: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/25.jpg)
PYROBAYES: determine base number
![Page 26: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/26.jpg)
PYROBAYES: Performance
• better correlation between assigned and measured quality values
• higher fraction of high-quality bases
![Page 27: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/27.jpg)
Base quality value calibration
RawIllumina reads(1000G data)
![Page 28: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/28.jpg)
Recalibrated base quality values (Illumina)
RecalicratedIllumina reads(1000G data)
![Page 29: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/29.jpg)
… and they give you the picture on the box
2. Read mapping
Read mapping is like doing a jigsaw puzzle…
…you get the pieces…
Unique pieces are easier to place than others…
![Page 30: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/30.jpg)
Non-uniqueness of reads confounds mapping
• Reads from repeats cannot be uniquely mapped back to their true region of origin
• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length
![Page 31: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/31.jpg)
Strategies to deal with non-unique mapping
• Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented)
0.8 0.19 0.01
read
• mapping to multiple loci requires the assignment of alignment probabilities (mapping qualities)
![Page 32: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/32.jpg)
Longer reads are easier to map
454 FLX(1000G data)
![Page 33: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/33.jpg)
Paired-end reads help unique read placement
• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency
Korbel et al. Science 2007
• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity
PE
MP
• PE reads are now the standard for genome resequencing
![Page 34: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/34.jpg)
MOSAIK
![Page 35: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/35.jpg)
INDEL alleles/errors – gapped alignments
454
![Page 36: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/36.jpg)
Aligning multiple read types together
ABI/capillary
454 FLX
454 GS20
Illumina
• Alignment and co-assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics
![Page 37: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/37.jpg)
Aligner speed
![Page 38: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/38.jpg)
3. Polymorphism / mutation detection
sequencing error
polymorphism
![Page 39: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/39.jpg)
Allele calling in “trad” sequences
capillary sequences:• either clonal• or diploid traces
![Page 40: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/40.jpg)
Allele calling in next-gen data
SNP
INS
New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection
![Page 41: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/41.jpg)
Human genome polymorphism projects
common SNPs
![Page 42: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/42.jpg)
Human genome polymorphism discovery
![Page 43: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/43.jpg)
The 1000 Genomes Project
![Page 44: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/44.jpg)
New challenges for SNP calling
• deep alignments of 100s / 1000s of individuals • trio sequences
![Page 45: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/45.jpg)
Rare alleles in 100s / 1,000s of samples
![Page 46: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/46.jpg)
Allele discovery is a multi-step sampling process
Population Samples Reads Allele detection
![Page 47: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/47.jpg)
Capturing the allele in the sample
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1E-0
4
2E-0
4
5E-0
40.
001
0.00
20.
005
0.01
0.02
0.05 0.
10.
20.
5
Population AF
Pro
b(a
llele
cap
ture
d in
sam
ple
)
n=100
n=200
n=400
n=800
n=1600
![Page 48: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/48.jpg)
Allele calling in deep sequence data
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
Q30 Q40 Q50 Q60
1 0.01 0.01 0.1 0.5
2 0.82 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
![Page 49: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/49.jpg)
Allele calling in the reads
1 2
1 21
1
1 2
Pr | Pr | Pr , , ,
Pr | Pr | Pr , , ,
Pr , , , |i
kT
ii n
l kT
nk ki i i n
i
nk k l l l li i
iG
n
B T T G G G G
B T T G G G G
G G G B
base call
sample size
GigaBayesGigaBayes
individual read coverage
base quality
![Page 50: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/50.jpg)
More samples or deeper coverage / sample?
Shallower read coverage from more individuals …
…or deeper coverage from fewer samples?
simulation analysis by Aaron
Quinlan
![Page 51: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/51.jpg)
Analysis indicates a balance
![Page 52: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/52.jpg)
SNP calling in trios
2
2
2 22 2
2
2
2
2 2
2
11 12 22
1 111: 1 1
2 2 11: 111: 11 1
11 12 : 2 1 12 : 2 1 1 12 : 12 2
22 : 22 : 11 122 : 1
2 2
1 1 111: 1 1 11:
2 2 4Pr | , 1 1
12 12 : 2 1 12 2
1 122 : 1
2 2
M M M
F
C M F
F
G G G
G
G G GG
2 2 2
2 22 2
2 22
2
2 22 2
1 1 1 11 1 11: 1
2 4 2 21 1 1 1 1
12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2
1 1 1 1 122 : 1 1 22 : 1 1
4 2 4 2 2
1 111: 1
2 211: 11 1
22 12 : 1 12 : 12
22 : 1FG
2
2
2
11:
2 1 12 : 2 12
22 : 11 122 : 1 1
2 2
• the child inherits one chromosome from each parent• there is a small probability for a mutation in the child
![Page 53: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/53.jpg)
SNP calling in trios
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
mother father
childP=0.79
P=0.86
![Page 54: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/54.jpg)
Determining genotype directly from sequence
AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATA
individual 1
individual 3
individual 2
A/C
C/C
A/A
![Page 55: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/55.jpg)
4. Structural variation discovery
![Page 56: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/56.jpg)
SV events from PE read mapping patterns
Deletion
DNA reference
LM ~ LF+Ldel & depth: low
pattern
LMLF
Ldel
Tandemduplication
LM ~ LF-Ldup & depth: highLdup
Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv
Translocation
LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2
LT2 LT1
LM LM
LM
InsertionLins
un-paired read clusters & depth normal
Chromosomaltranslocation
LT
LM ~LF+LT & depth: normal& cross-paired read clusters
![Page 57: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/57.jpg)
Deletion: Aberrant positive mapping distance
![Page 58: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/58.jpg)
Copy number estimation from depth of coverage
![Page 59: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/59.jpg)
Spanner – a hybrid SV/CNV detection tool
Navigation bar
Fragment lengths in selected region
Depth of coverage in selected region
![Page 60: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/60.jpg)
5. Data visualization
1. aid software development: integration of trace data viewing, fast navigation, zooming/panning
2. facilitate data validation (e.g. SNP validation): simultanous viewing of multiple read types, quality value displays
3. promote hypothesis generation: integration of annotation tracks
![Page 61: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/61.jpg)
Data visualization
![Page 62: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/62.jpg)
New analysis tools are needed
1. Tailoring existing tools for specialized applications (e.g. read mappers for transcriptome sequencing)
2. Analysis pipelines and viewers that focus on the essential results e.g. the few mutations in a mutant, or compare 1000 genome sequences (but hide most details)
3. Work-bench style tools to support downstream analysis
![Page 63: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/63.jpg)
Data storage and data standards
![Page 64: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/64.jpg)
What level of data to store?
images
traces
base quality values
base-called reads
![Page 65: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/65.jpg)
Data standards
• Sequence Read Format, SRF (Asim Siddiqui, UBC)[email protected]
• Assembly format working grouphttp://assembly.bc.edu
• Genotype Likelihood Format (Richard Durbin, Sanger)
![Page 66: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/66.jpg)
Summary
![Page 67: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/67.jpg)
Conclusions: next-gen sequencing software
• Next-generation sequencing is a boon for mass-scale human resequencing, whole-genome mutational profiling, expression analysis and epigenetic studies
• Informatics tools already effective for basic applications
• There is a need both for “generic” analysis tools e.g. flexible read aligners and for specialized tools tailored to specific applications (e.g. expression profiling)
• Move toward tools that focus on biological analysis
• Most challenges are technical in nature (e.g. data storage, useful data formats, fast read mapping)
![Page 68: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/68.jpg)
Software tools for next-gen data
http://bioinformatics.bc.edu/marthlab/Beta_Release
![Page 69: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/69.jpg)
Roche / 454 system
• pyrosequencing technology• variable read-length• the only new technology with >100bp reads
![Page 70: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/70.jpg)
Illumina / Solexa Genome Analyzer
• fixed-length short-read sequencer• very high throughput• read properties are very close to traditional capillary sequences • low INDEL error rate
![Page 71: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/71.jpg)
AB / SOLiD system
A C G T
A
C
G
T
2nd Base
1st
Bas
e
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
• fixed-length short-reads• very high throughput• 2-base encoding system• color-space informatics
![Page 72: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/72.jpg)
Helicos / Heliscope system
• short-read sequencer• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing
![Page 73: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/73.jpg)
Data characteristics
![Page 74: Next-generation sequencing: informatics & software aspects](https://reader034.fdocuments.in/reader034/viewer/2022051315/56813424550346895d9b1093/html5/thumbnails/74.jpg)
Data standards
• different data storage needs (archival, transfer, processing) often poses contradictory requirements (e.g. normalized vs. non-normalized storage of assembly, alignment, read, image data)
• even different analysis goals often call for different optimal storage / data access strategies (e.g. paired-end read analysis for SV detection vs. SNP calling) • requirements include binary formats, fast sequential and / or random access, and flexible indexing (e.g. an entire genome assembly can no longer reside in RAM)