Genome sequence assembly
description
Transcript of Genome sequence assembly
![Page 1: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/1.jpg)
1
![Page 2: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/2.jpg)
2
Genome sequence assemblyAssembly concepts and methods
Mihai Pop
Center for Bioinformatics and Computational Biology
University of Maryland
![Page 3: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/3.jpg)
3
Building a library
• Break DNA into random fragments (8-10x coverage)
Actual situation
![Page 4: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/4.jpg)
4
Building a library
• Break DNA into random fragments (8-10x coverage)• Sequence the ends of the fragments
– Amplify the fragments in a vector– Sequence 800-1000 (500-700) bases at each end of the fragment
![Page 5: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/5.jpg)
5
Assembling the fragments
![Page 6: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/6.jpg)
6
Forward-reverse constraints• The sequenced ends are facing towards each other • The distance between the two fragments is known
(within certain experimental error)
Clone
Insert
F R
FR
I II
R
I
F
II
F
II
R
I
![Page 7: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/7.jpg)
7
Building Scaffolds
• Break DNA into random fragments (8-10x coverage)
• Sequence the ends of the fragments
• Assemble the sequenced ends
• Build scaffolds
![Page 8: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/8.jpg)
8
Assembly gaps
sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap
physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap
Sequencing gaps
Physical gaps
![Page 9: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/9.jpg)
9
Unifying view of assembly
Assembly
Scaffolding
![Page 10: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/10.jpg)
10
Shotgun sequencing statistics
![Page 11: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/11.jpg)
11
Typical contig coverage
1 2 3 4 5 6 Coverage
Contig
Reads
Imagine raindrops on a sidewalk
![Page 12: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/12.jpg)
12
Lander-Waterman statistics
L = read lengthT = minimum detectable overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/L
E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ)contig = island with 2 or more reads
![Page 13: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/13.jpg)
13
Example
c N #islands #contigs bases not in any read
bases not in contigs
1 1,667 655 614 698 367,806
3 5,000 304 250 121 49,787
5 8,334 78 57 20 6,735
8 13,334 7 5 1 335
Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40
![Page 14: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/14.jpg)
14
Experimental data
X coverage
# ctgs % > 2X avg ctg size (L-W) max ctg size # ORFs
1 284 54 1,234 (1,138) 3,337 526
3 597 67 1,794 (4,429) 9,589 1,092
5 548 79 2,495 (21,791) 17,977 1,398
8 495 85 3,294 (302,545) 64,307 1,762
complete 1 100 1.26 M 1.26 M 1,329
Caveat: numbers based on artificially chopping upthe genome of Wolbachia pipientis dMel
![Page 15: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/15.jpg)
15
Read coverage vs. Clone coverage
4 kbp
1 kbp
Read coverage = 8X
Clone (insert) coverage = 16
2X coverage in BAC-ends implies 100x coverage by BACs
(1 BAC clone = approx. 100kbp)
![Page 16: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/16.jpg)
16
Assembly paradigms
• Overlap-layout-consensus– greedy (TIGR Assembler, phrap, CAP3...)– graph-based (Celera Assembler, Arachne)
• Eulerian path (especially useful for short read sequencing)
![Page 17: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/17.jpg)
17
TIGR Assembler/phrap
Greedy
• Build a rough map of fragment overlaps
• Pick the largest scoring overlap
• Merge the two fragments
• Repeat until no more merges can be done
![Page 18: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/18.jpg)
18
Overlap-layout-consensusMain entity: readRelationship between reads: overlap
12
3
45
6
78
9
1 2 3 4 5 6 7 8 9
1 2 3
1 2 3
1 2 3 12
3
1 3
2
13
2
ACCTGAACCTGAAGCTGAACCAGA
![Page 19: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/19.jpg)
19
Paths through graphs and assembly
• Hamiltonian circuit: visit each node (city) exactly once, returning to the start
A
B D C
E
H G
I
F
A
B
C
D H
I
F
G
E
Genome
![Page 20: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/20.jpg)
20
Implementation details
![Page 21: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/21.jpg)
21
Overlap between two sequences
…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…
overlap (19 bases) overhang (6 bases)
overhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequences
The assembler screens merges based on: • length of overlap• % identity in overlap region• maximum overhang size.
% identity = 18/19 % = 94.7%
![Page 22: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/22.jpg)
22
All pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs
are possible– Build a table of k-mers contained in sequences (single
pass through the genome)– Generate the pairs from k-mer table (single pass
through k-mer table)
k-mer
A
B
C
D H
I
F
G
E
![Page 23: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/23.jpg)
23
![Page 24: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/24.jpg)
24
REPEATS
![Page 25: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/25.jpg)
25
1
2
3
4
5
6
7
8
9
10
11
12
13
RptA RptB
1
2
3
4
5 7
10
11
12
13
86
9
![Page 26: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/26.jpg)
26
Non-repetitive overlap graph
1
2
3
4
5,9 7
86,10
11
12
13
![Page 27: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/27.jpg)
27
Handling repeats1. Repeat detection
– pre-assembly: find fragments that belong to repeats• statistically (most existing assemblers)• repeat database (RepeatMasker)
– during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)
– post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker• "unhappy" mate-pairs (too close, too far, mis-oriented)
2. Repeat resolution– find DNA fragments belonging to the repeat– determine correct tiling across the repeat
![Page 28: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/28.jpg)
28
Statistical repeat detectionSignificant deviations from average coverage flagged as repeats.
- frequent k-mers are ignored- “arrival” rate of reads in contigs compared with theoretical value
(e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)
Problem 1: assumption of uniform distribution of fragments - leads to false positives
non-random librariespoor clonability regions
Problem 2: repeats with low copy number are missed - leads to false negatives
![Page 29: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/29.jpg)
29
Mis-assembled repeats
a b c
a c
b
a b c d
I II III
I
II
III
a
b c
d
b c
a b d c e f
I II III IV
I III II IV
a d b e c f
a
collapsed tandem excision
rearrangement
![Page 30: Genome sequence assembly](https://reader035.fdocuments.in/reader035/viewer/2022062309/56814e5e550346895dbbfcac/html5/thumbnails/30.jpg)
30
MTETVEDKVSHSITGLDILKGIVAAGAVISGTVATQTKVFTNESAVLEKTVEKTDALATNDTVVLGTISTSNSASSTSLSASESASTSASESASTSASTSASTSASESASTSASTSISASSTVVGSQTAAATEATAKKVEEDRKKPASDYVASVTNVNLQSYAKRRKRSVDSIEQLLASIKNAAVFSGNTIVNGAPAINASLNIAKSETKVYTGEGVDSVYRVPIYYKLKVTNDGSKLTFTYTVTYVNPKTNDLGNISSMRPGYSIYNSGTSTQTMLTLGSDLGKPSGVKNYITDKNGRQVLSYNTSTMTTQGSGYTWGNGAQMNGFFAKKGYGLTSSWTVPITGTDTSFTFTPYAARTDRIGINYFNGGGKVVESSTTSQSLSQSKSLSVSASQSASASASTSASASASTSASASASTSASASASTSASVSASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASGSASTSTSASASTSASASASTSASASASISASESASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASVSASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASVSASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASASTSASASASTSASASASTSASASASISASESASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASVSASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASASTSASASASTSASASASTSASASASISASESASTSASASASASTSASASASTSASASASTSASASASISASESASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASASASTSASASASTSASESASTSASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASGSASTSTSASASTSASASASTSASASASISASESASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASVSASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASVSASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASESASTSTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASVSASTSASESASTSASASASTSASASASTSASESASTSASASASTSASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASVSASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASISASESASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSASASASTSVSNSANHSNSQVGNTSGSTGKSQKELPNTGTESSIGSVLLGVLAAVTGIGLVAKRRKRDEEE
SASA repeat (4776 AA, 14Kb)from Streptococcus pneumoniae