De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its...
Transcript of De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its...
![Page 1: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/1.jpg)
De-Novo Genome Assembly and its Current State
Anne-Katrin Emde April 17, 2013
Freie Universität Berlin, Algorithmische Bioinformatik Max Planck Institut für Molekulare Genetik, Computational Molecular Biology
![Page 2: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/2.jpg)
3
What is genome assembly and why is it hard?
Original genome sequence (in multiple copies) Sequence short fragments Reconstruct original sequence from reads
Assembler
Difficulties: - Genomes are very long and repeat-rich - Reads are very short and may contain errors and biases
![Page 3: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/3.jpg)
4
Assembly algorithms - introduction Assemblers use overlaps between reads, to first produce contigs, that are then used to build scaffolds.
Read pairs contribute long-range linking information, especially in the scaffolding phase.
reads
contig NNNN...NNNN
scaffold
![Page 4: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/4.jpg)
6
2 Paradigmen 1) Overlap – Layout – Consensus: Focus is on sequence reads
2) De Bruijn graph: Focus is on k-tuples occuring in sequence reads
NGS algorithms, March 20 2012
![Page 5: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/5.jpg)
7
Algorithms – Overlap Graph
ACGTAATT GTAATTCA ATTCAGTC GTCCATGT CATGTTGA TGTTGACT
ACGTAATTCAGTCCATGTTGACT Kececioglu and Myers, 1995
1) Overlap phase: pairwise overlap alignments
2) Layout phase: overlap graph construction and finding relative placement of reads
3) Consensus phase: Produce multiple read alignment and compute contig consensus sequences
![Page 6: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/6.jpg)
8
Algorithms – Overlap Graph 1) Overlap phase:
for all pairs of reads
Computationally expensive!
read1
re
ad2
pairwise overlap alignments
![Page 7: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/7.jpg)
9
Algorithms – Overlap Graph 1) Overlap phase:
2) Layout phase:
nodes = reads edges = overlaps
r1
r2
r3
r4 r5
r6
r1 r2
r3
r4
r5 r6 -6
-5 -3
-3 -3
-5 -6
![Page 8: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/8.jpg)
10
Algorithms – Overlap Graph 1) Overlap phase:
2) Layout phase:
nodes = reads edges = overlaps
r1
r2
r3
r4 r5
r6
r1 r2
r3
r4
r5 r6 -6
-5 -3
-3 -3
-5 -6
path through the graph read layout
In theory Hamiltonian path (NP-complete), in practice heuristics
![Page 9: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/9.jpg)
11
Algorithms – Overlap Graph
2) Layout phase:
nodes = reads edges = overlaps
r1 r2
r3
r4
r5 r6 -6
-5 -3
-3 -3
-5 -6
3) Consensus phase: multiple read alignment
contig
ACGTAATT GTAATTCA ATTCAGTC GTCCATGT CATGTTGA TGTTGACT
ACGTAATTCAGTCCATGTTGACT
r1 r2 r3 r4 r5 r6
![Page 10: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/10.jpg)
12
Algorithms – Overlap Graph
2) Layout phase:
nodes = reads edges = overlaps
3) Consensus phase: multiple read alignment
contig
ACGTAATT GTAATTCA AT - CAGTC GTCCATGT CATATTGA TGTTGACT
ACGTAATTCAGTCCATGTTGACT
r1 r2 r3 r4 r5 r6
robust with alignment errors
r1 r2
r3
r4
r5 r6 -6
-5 -3
-3 -3
-5 -6
![Page 11: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/11.jpg)
13
![Page 12: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/12.jpg)
14
![Page 13: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/13.jpg)
15
![Page 14: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/14.jpg)
16
Overlap Graph There are not only simple paths…
R R A B C
- Coverage is high - Path branches is a repeat! R
Approximate ordering in the overlap graph:
R A
B
C
![Page 15: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/15.jpg)
17
Overlap Graph There are not only simple paths…
Approximate ordering in the overlap graph:
R R A B C
A R
B
C
![Page 16: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/16.jpg)
18
Overlap Graph Now: R1 and R2 are nearly identical
Approximate ordering in the overlap graph:
A A A A T
T T T
A A A A
T T T T
R1 R2 A B C
A Rx
B
C
Overlap strictness: Tradeoff between error tolerance and “natural” repeat separation
![Page 17: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/17.jpg)
20
Overlap Graph Now: R1 and R2 are nearly identical
Approximate ordering in the overlap graph:
A A A
Overlap strictness: Tradeoff between error tolerance and “natural” repeat separation
R1 R2 A B C
A R1
B
C R2
A T T T T
![Page 18: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/18.jpg)
21
Algorithms - de Bruijn graph (Euler assembler) - No overlap phase, no consensus phase, basically just a layout phase
ACGT CGTA GTAA TAAT TACG TTCA ATTC AATT
contig: TACGTAATTCA
Nodes = k-mers Edges = (k+1)-mers
de Bruijn, 1946; Pevzner, 2001; Medvedev 2007
Given k = 4 and three read sequences: CGTAATTC GTAATTCA TACGTAAT
CGTA GTAA TAAT AATT ATTC GTAA
TACGTAAT CGTAATTC GTAATTCA
r3
r1
r2
r1
r2
r3
In theory „de Bruijn Superwalk Problem“ (NP-hard), in practice heuristics
r1
r2 r3
![Page 19: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/19.jpg)
22
de Bruijn graph Again, there are not only simple paths…
ACGT
CGTA
GTAC
TACG
GACG
GTCA
CGTC
...G ACGT ACGTCA... GACGTACG
CGTACGTC GTACGTCA
GACGTCA is a read-incoherent path
Repetitive k-mers introduce cycles
![Page 20: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/20.jpg)
23
de Bruijn graph What if we increase k to 5?
GACGT
...GACGTACGTCA... GACGTACG
CGTACGTC GTACGTCA
Back to a linear graph structure
ACGTA CGTAC GTACG TACGT ACGTC CGTCA
Increasing k leads to better repeat resolution
![Page 21: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/21.jpg)
24
de Bruijn graph What if we have sequencing errors?
...GACGTACGTCA... GACGTACG
CGTACGTC GTACGGCA
Additional nodes
GACGT ACGTA CGTAC GTACG TACGT ACGTC CGTCA
TACGG ACGGC CGGCA
![Page 22: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/22.jpg)
25
A quick example
GTCG (1x)
TCGA (1x)
CGAG (1x)
GAGG (1x)
One read: GTCGAGG
![Page 23: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/23.jpg)
26
A quick example
AGAT (8x)
ATCC (7x)
TCCG (7x)
CCGA (7x)
CGAT (6x)
GATG (5x)
ATGA (8x)
TGAG (9x)
GATC (8x)
GATT (1x)
TAGT (3x)
AGTC (7x)
GTCG (9x)
TCGA (10x)
GGCT (11x)
TAGA (16x)
AGAG (9x)
GAGA (12x)
GACA (8x)
ACAG (5x)
GCTT (8x)
GCTC (2x)
CTTT (8x)
CTCT (1x)
TTTA (8x)
TCTA (2x)
TTAG (12x)
CTAG (2x)
AGAC (9x)
AGAA (1x)
CGAG (8x)
CGAC (1x)
GAGG (16x)
GACG (1x)
AGGC (16x)
ACGC (1x)
All the others…
![Page 24: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/24.jpg)
27
A quick example
AGAT (8x)
ATCC (7x)
TCCG (7x)
CCGA (7x)
CGAT (6x)
GATG (5x)
ATGA (8x)
TGAG (9x)
GATC (8x)
GATT (1x)
TAGT (3x)
AGTC (7x)
GTCG (9x)
TCGA (10x)
GGCT (11x)
TAGA (16x)
AGAG (9x)
GAGA (12x)
GACA (8x)
ACAG (5x)
GCTT (8x)
GCTC (2x)
CTTT (8x)
CTCT (1x)
TTTA (8x)
TCTA (2x)
TTAG (12x)
CTAG (2x)
AGAC (9x)
AGAA (1x)
CGAG (8x)
CGAC (1x)
GAGG (16x)
GACG (1x)
AGGC (16x)
ACGC (1x)
All the others…
![Page 25: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/25.jpg)
28
A quick example
TAGTCGA
AGAGA TAGA
AGAT
GCTTTAG
GCTCTAG
AGACAG
AGAA
CGAG
CGACGC
GAGGCT
GATCCGATGAG
GATT
After simplification…
![Page 26: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/26.jpg)
29
Error removal
TAGTCGA
AGAGA TAGA
AGAT
GCTTTAG
GCTCTAG
AGACAG
CGAG
GAGGCT
GATCCGATGAG
Tips removed…
![Page 27: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/27.jpg)
30
Error removal
TAGTCGA
AGAGA TAGA
AGAT
GCTTTAG AGACAG
CGAG
GAGGCT
GATCCGATGAG
Bubbles removed …
![Page 28: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/28.jpg)
31
Error removal
TAGTCGAG AGAGACAG
AGATCCGATGAG
GAGGCTTTAGA
Final simplification…
![Page 29: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/29.jpg)
32
de Bruijn graph What if we have sequencing errors?
...GACGTACGTCA... GACGTACG
CGTACGTC GTACCTCA
Additional nodes
GACGT ACGTA CGTAC GTACG TACGT ACGTC CGTCA
TACCT ACCTC CCTCA GTACC
& graph disconnected
Choice of k: Tradeoff between error tolerance and repeat resolution
![Page 30: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/30.jpg)
36
What is a good assembly? Some assembly evaluation metrics:
- High N50 contig size (most commonly used metric)
contigs sorted by size
N50 contig size
assembled as
Only if a reference sequence is known!
original sequence
- Low number of assembly errors (sequence errors, structural misjoins)
![Page 31: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/31.jpg)
39
Conclusions - Most assemblers use either the OLC or de Bruijn graph paradigm,
both lead to NP-hard assembly models.
- However, assembler performance is independent of the underlying paradigm and mainly depends on heuristics for repeat resolution and handling noise.
- How to measure assembly accuracy is another important aspect, in general tradeoff between assembly contiguity and correctness.
- Evaluations show that assembly is far from solved, assembler performance still quite inconsistent.
“For large genomes, the choice of assemblers is often limited to those that will run without crashing” (GAGE paper, 2012)
![Page 32: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische](https://reader034.fdocuments.in/reader034/viewer/2022050601/5fa81a607633320a6a4c02d5/html5/thumbnails/32.jpg)
40
References - GAGE: a critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg et
al., Genome Res. 2012 - Assemblathon 2: evaluating de-novo of genome assembly in three vertebrate species, Bradnam et
al., not yet published - Assembly algorithms for next-generation sequencing data. Jason R. Miller, Sergey Koren, Granger
Sutton, Genomics, 2010. - Fragment assembly string graph, Myers, 2005
Figures: - geneed.nlm.nih.gov
- http://paper-shredding-services-review.toptenreviews.com/ - www.illumina.com - A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation
Sequencing Technologies, Zhang et al., PLOS one, 2011
Titel, Datum