Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao,...
-
Upload
herbert-collins -
Category
Documents
-
view
218 -
download
2
Transcript of Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao,...
![Page 1: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/1.jpg)
Opera: Reconstructing optimal genomic scaffolds with high-
throughput paired-end sequences
Song Gao, Niranjan Nagarajan, Wing-Kin Sung
National University of SingaporeGenome Institute of Singapore
![Page 2: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/2.jpg)
2
Outline
Overview• Methods
- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation• Results• Ongoing Work
![Page 3: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/3.jpg)
3
Transcripts
Microbial Community
Biological Entity Data Entity
GenomeGenomic Sequence
TranscriptAssembly
Metagenome
Reads Analysis
ACGTTTAACAGG…TTACGATTCGATGA…GCCATAATGCAAG…
CTTAGAATCGGATAGAC…AGGCATAGACTAGAG…
Sequencing Machine
![Page 4: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/4.jpg)
4
Sequence Assembly
Reads Contigs ScaffoldsPaired-end Reads
Related Research Works
Contig Level
OLC Framework:
De Bruijn Graph:
Scaffold Level
Comparative Assembly:
Embedded Module:
Standalone Module:
(I) (II)
Celera Assembler[Myers et al,2000], Edena[Hernandez et al,2008],Arachne[Batzoglou et al,2002], PE Assembler[Ariyaratne et al ,2011]
EULER[Pevzner et al, 2001] , Velvet[Zerbino et al,2008] ,ALLPATHS[Butler et al,2008], SOAPdenovo[Li et al,2010]
AMOScmp[Pop,2004], ABBA[Salzberg,2008]
EULER[Pevnezer et al, 2001], Arachne[Batzoglou et al ,2002],Celera Assembler[Myers et al,2000], Velvet[Zerbino, 2008]
Bambus[Pop, et al, 2004] , SOPRA[Dayarian et al, 2010]
![Page 5: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/5.jpg)
5
Scaffolding Problem[Huson et al, 2002]
Value AdditionGap Filling:
GapCloser Module of SOAPdenovo
Repeat Resolution
Long-Range Genomic Structure
1k 3k 2.5k
Discordant Read
Paired-end Read Scaffold
Contig
* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)
![Page 6: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/6.jpg)
6
DataSequencing Errors Read Length Coverage
Analysis
Long Insert vs. Long Read[Chaisson, 2009; Zerbino, 2009]
Statistics of Assembled Genomes[Schatz et al, 2010]
Organism Genome Size
Grapevine 500Mb
Panda 2.4Gb
Strawberry 220Mb
Turkey 1.1Gb
* Zerbino, D.R.: Pebble and rock band: heuristic resolution of repeats and scaolding in the velvet short-read de novo assembler. PLoS ONE, 4(12) (2009)* Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Research 19, 336-346 (2009)
# of Contigs N50
58,611 18.2kb
200,604 36.7kb
16,487 28.1kb
128,271 12.6kb
# of Scaffold N50
2,093 1.33Mb
81,469 1.22Mb
3,263 1.44Mb
26,917 1.5Mb
* Schatz M. C., Arthur L. D., Steven L. S.: Assembly of large genomes using second-generation sequencing. Genome Research, 20-9, 1165-1173 (2010)
* N50: Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L >= N.
![Page 7: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/7.jpg)
7
NP-Complete [Huson et al, 2002]
* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)
![Page 8: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/8.jpg)
8
Heuristic Methods- Celera Assembler[Myers et al,2000] - Euler[Pevzner et al, 2001]
- Jazz[Chapman et al, 2002] - Arachne[Batzoglou et al ,2002]
- Velvet[Zerbino et al,2008] - Bambus[Pop, et al, 2004]
“True Complexity”Phase transition based on parameters[Hayes, 1996]
Parametric Complexity[Rodney et al, 1999]
Vertex Cover Problem
Fixed-parameter tractabillity
* Hayes, B. Can't get no satisfaction. American. Scientist. 85, 108-112 (1996).
3-SAT Problem
* Rodney G. D., et al. Parameterized Complexity: A Framework for Systematically Confronting Computational Intractability. DIMACS. Vol 49. 1999
![Page 9: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/9.jpg)
9
Outline
• Overview Methods
- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation• Results• Ongoing Work
![Page 10: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/10.jpg)
10
1. Pre-Processing
Paired-end Reads -> Clusters [Huson et al, 2002]
Chimeric NoiseFiltered by simulation
* Upper Bound of Paired-end Reads
3
* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)
Chimera
![Page 11: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/11.jpg)
11
No discordant clusters in final scaffold
Naïve Solution
+A
+A+B
+A-B
+A+C
+A-C
+A+B+C
+A+B-C
Exponential Time
+A-C+B
+A-C-B
…
……
A B C D
2. A Special Case
![Page 12: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/12.jpg)
12
Dynamic ProgrammingScaffold Tail is Sufficient
Analogous to Bandwidth Problem[Saxe, 1980]
Orientation of Nodes
Direction of Edges
Discordant Edges …
* J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363-369 (1980)
width(w)
Upper Bound
![Page 13: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/13.jpg)
Equivalence class of scaffoldsS1 and S2 have the same tail -> They are in the same class
Feature of equivalence class:
- Use of the same set of contigs;
- All or none of them can be extended to a solution
Tail
+A-B+C
+D+E
-A+C
+D+E+F
…
![Page 14: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/14.jpg)
14
Equivalence ClassNumber of Discordant Edges (p)
Chimeric Reads
ACCAAAATTT
ACCAAGAATTT
Sequencing Errors
CTAGAA CAAGAA
?
Mapping Errors
3. Full Algorithm
Consider discordant clusters
![Page 15: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/15.jpg)
4. Graph Contraction
20k
![Page 16: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/16.jpg)
4. Graph Contraction
![Page 17: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/17.jpg)
4. Graph Contraction
![Page 18: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/18.jpg)
18
UtilityGenome finishing(Genome Size Estimation)
Scaffold Correctness
Calculate Gap Sizes
Maximum Likelihood
Quadratic Function
Solved through quadratic programming [Goldfarb, et al, 1983]
Polynomial Time
g1 g2 g3
μ,σ
5. Gap Estimation
* Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983)
![Page 19: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/19.jpg)
19
Outline
• Overview• Methods
- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation
Results• Ongoing Work
![Page 20: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/20.jpg)
20
Runtime Comparison
◆E. coli ★B. pseudomallei ◆S. cerevisiae ◆D. melanogaster
Bambus 50s 16m 2m 3m
SOPRA 49m - 2h 5h
Opera 4s 7m 11s 30s
• Coverage of 300bp insert library: >20X• Coverage of 10kbp insert library: 2X• Contigs assembled using Velvet
◆ Simulated data set using MetaSim ★ In house data
![Page 21: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/21.jpg)
21
Scaffold Contiguity
E. coli B. pseudomallei S. cerevisiae D. melanogaster0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
N50
Velvet
Bambus
SOPRA
Opera
E. coli B. pseudomallei S. cerevisiae D. melanogaster0
1
2
3
4
5
6
7
8
9
Max Length
Velvet
Bambus
SOPRA
Opera
![Page 22: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/22.jpg)
22
Scaffold Correctness
E. coli S. cerevisiae D. melanogaster0
20
40
60
80
100
120
# of Breakpoints
VelvetBambusSOPRAOpera
![Page 23: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/23.jpg)
23
Scaffold Correctness
E. coli S. cerevisiae D. melanogaster0
2
4
6
8
10
12
14
16
18
# of Discordant Edges
VelvetOpera
E.coli S. cerevisiae D. melanogaster
Opera 1 3 4
Bambus 19 55 423
![Page 24: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/24.jpg)
24
Ongoing Work
Genome Size N50
Opera ~2Gbp 765.5Kbp
SSpace 281.7Kbp
A Rodent Genome
A Tree Genome
Genome Size N50 Max Length
Opera ~300Mbp 209.9Kbp 921.8Kbp
![Page 25: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/25.jpg)
25
Ongoing Work
Repeats
Lower bounds and better scaffold
Multiple Libraries
Other applications
Metagenomics
Cancer Genomics
Link: https://sourceforge.net/projects/operasf/
![Page 26: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e755503460f94b759a8/html5/thumbnails/26.jpg)
26
Acknowledgement
Questions?
Wing-Kin Sung Niranjan Nagarajan
Pramila N. Ariyaratne
Fundings:
A*STAR of Singapore
Ministry of Education, Singapore
NUS Graduate School for Integrative Sciences and Engineering (NGS)