De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs ·...

61
De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs

Transcript of De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs ·...

Page 1: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

De novo repeat classification and fragment assembly:

from de Bruijn to A-Bruijn graphs

Page 2: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

IS30 repeat family

Repeats in Bacterial Genome: IS30 repeat family in N. meningitidis

Edge label: 130(2) means length 130bp and multiplicity 2

Page 3: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

De novo repeat classification and fragment assembly with A-Bruijn

graph approach

Pavel Pevzner1, Haixu Tang2, Glenn Tesler3

1Department of Computer Science and Engineering, UCSD2Department of Computer Science, University of Indiana

3Department of Mathematics, UCSD

Page 4: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Adapted from http://www.hhmi.org/research/investigators/eichler.html

Duplication landscape of 2p11

Mosaic Arrangements in Segmental Duplications in Human Genome

Page 5: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Two-step model of segmental duplications (Evan Eichler, 1997)

Adapted from Horvath et al 2005

Ancestral duplication units are called duplicons

Page 6: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Mosaic repeats structure: an imaginary example

A B C D E F G H I J

A B C D E F G H I JC

A B C D E F G H I JC B C D

A B C D E F G H I JC B C D F GC

• The mosaic structure of segmental duplications in human genome is revealed using the A-Bruijn graph approach:Jiang, Tang, She, P.P, Eichler. Evolutionary reconstruction of segmental duplications

reveals punctuated cores of human gene innovations (Nature Genetics, 2007)

Page 7: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

The Segmental Duplication in Human Genome: The Ancestral Duplicon Problem

• The duplicon problem: can we identify duplicons?• Case-by-case studies:

* Jackson et al 1999; …, Stankiewicz et al 2004; Horvath et al 2005• All duplicons: Hard

* Eichler group •only had limited success with chr22 (Bailey et al 2002),

• The mosaic structure of segmental duplications in human genome is revealed using the A-Bruijn graph approach:

Zhaoshi Jiang, Haixu Tang, Xinwei She, PP, Evan Eichler. Evolutionary reconstruction of segmental duplications reveals punctuated cores of human gene innovations (Nature Genetics, in press)

Page 8: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

The Duplicon Problem• Tang et al 2007 proposed a computational method for identifying all duplicons.

Adapted from Tang et al 2005

Page 9: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Repeat Classification

• Repeat representation as a mosaic of sub-repeats

• Detailed study of each sub-repeat and its further classification into repeat sub-families

Page 10: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Mosaic Structure of Segmental Duplication in Human Chromosome 22

Bailey et. al. 2002, Am. J. Hum. Genet.

Page 11: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Algorithmic Challenge

• Problem: find all repeat elements and reveal the sub-repeat mosaic structure.– Perfect repeats: de Bruijn graph, suffix tree.– Imperfect repeats: OPEN PROBLEM.

• Goal: Generalize the de Bruijn graph for imperfect repeats.

Page 12: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

De Novo Repeat Classification: messy and ill-defined problem

“The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack.”

Bao & Eddy 2002, Genome Research

Page 13: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

De Novo Repeat Classification: messy and ill-defined problem

“The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack.”

Bao & Eddy 2002, Genome Research

Page 14: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

De Novo Repeat Classification

Library of repeat elements

Element 1 AGCCTACG … …

… …

Element 2 TGCATTTT … …Element 3 GAACTCAC … …

De novo compilation

?

Reputer

Pairwise similarity

Page 15: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Similarity matrix

A B C D E F G H I JC B C D F GC

Page 16: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

De novo Repeat Classification: Previous Studies

1. RepeatFinder: Volfovsky et. al. 20012. RECON: Bao & Eddy, 2002

– Heuristic algorithms that work well for real data

– Do not reveal the mosaic structure

Page 17: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Annotating Known Repeat Elements

Library of (known) repeat elements

Element 1 AGCCTACG … …

… …

Element 2 TGCATTTT … …Element 3 GAACTCAC … …

De novo Repeat Classification Problem: Given a newly sequenced genome, find all repeat elements (sub-repeats) in this genome.

Genome

RepeatMasker

Page 18: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Mosaic Structure of Repeats: A Real Example(Clone G12 from Human Chromosome Y)

8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

RECON (Bao and Eddy, 2002) does not reveal the mosaic:

A-Bruijn representation 2 copies

3 copies

2 copies

4 copies

?

Page 19: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

De Bruijn Graph: Applications in Bioinformatics

• Sequencing by hybridization (Pevzner, 1989)• Re-sequencing with DNA arrays (Shamir and

Tsur, 2001, Peer et al., 2002)• Fragment assembly (Idury and Waterman, 1995,

Pevzner et. al. 2001)• EST analysis (Heber, et. al., 2002)• Computational mass-spectrometry (Bocker, 2003,

Bandeira et al., 2007)

Page 20: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

De Bruijn Graph: Classification of Perfect Repeats

ABCDEFCGHBCDIFCGJ

Vertices: (k-1)-mers from the sequence

AB BC CD DE EF FC CG

GHHB

DI IF

GJ

Every sub-repeat is represented as a repeat edge in the graph.

BCD FCG

Edges: k-mers from the sequence

Page 21: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Repeat Gluing

y y

x

xy

y

x y x y

x y

xy y

x y

Page 22: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Repeat Gluing

y y

x

xy

y

x y x y

x y

xy y

x y

gluing instruction

Page 23: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Similarity matrix

A B C D E F G H I JC B C D F GC

Page 24: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

A B C D E F G H I JC B C D F GC

CA B

FG

D

I

E

HJ

Sub-repeats:edges in the

repeatgraph

2 copies

2 copies

2 copies

2 copies4 copiesC

FB

D G

repeat graph

Page 25: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

In reality, repeats are usually imperfect

8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

… … AG-CCATCGACGTCACC … …… … AGTGCCTCG-CGTCTCC … …

Page 26: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Similarity matrix

A B C D E F G H I JC B C D F GC

Page 27: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

y y

x

x

Inconsistent Gluing

y y

x

x

Consistent Gluing

Page 28: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Challenge: Generalize the Notion of De Bruijn Graph for Imperfect Repeats

• Input– a genomic sequence– all significant local pairwise alignments

• Output– repeat graph representing all repeats as a

mosaic of sub-repeats

Page 29: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

A-Bruijn Graph Construction

… acat … acgt … ccat …Genomic sequence

a c a ta c g t

c c a t

A-graph

acatacgt

1 acgtccat

2 acatccat

3

Pairwise localalignment

cA-Bruijn graph

a,c a,g t

… a c a t … a c g t … c c a t …… a c a t… a c g t… c c a t…

Similarity matrix

Page 30: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

A-Bruijn Graph Construction: Bulges

… at … act … acat …Genomic sequence

a ta c t

a c a t

A-graph

c

A-Bruijn graph

a a t

a-tact

1 ac-tacat

2 a--tacat

3

Consistent pairwise localalignment

Similarity matrix… a t … a c t … a c a t …

… a t… a c t… a c a t…

Page 31: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

A-Bruijn Graph Construction: Whirls

… at … act … acat …Genomic sequence

a ta c t

a c a t

A-graph

c

A-Bruijn graph

a t

a-tact

1 ac-tacat

2 --atacat

3

Inconsistent pairwise localalignment

Similarity matrix… a t … a c t … a c a t …

… a t… a c t… a c a t…

Page 32: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

y y

x

x

repeat graph

Repeat Graph

A-Bruijn graph

Page 33: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

A-Bruijn graph

repeat graph

Problem: Simplifying A-Bruijn Graph

Page 34: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Removing Bulges and Whirls: Solving MSLG Problem

Maximum Subgraph with Large Girth (MSLG) Problem:

Input: a graph;Output: a maximum weight subgraph that does not contain short cycles, i. e. cycles of length less than a parameter girth.

Solution known only when the girth is infinite -- Maximum Spanning Tree Problem (maximum weight acyclic subgraph).

NP-hard problem (Skiena, 2002).

Page 35: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Minimum (or Maximum) Spanning Trees

• The first algorithm for finding a MST was developed by Boruvka in 1926 to minimize the cost of electrical coverage in Bohemia.

• The MST Problem– Connect all of the cities using the

least amount of wire possible

Page 36: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Solution: Maximum Spanning Tree Approximation to MSLG Problem

Page 37: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

RepeatGluer AlgorithmA-Bruijn graph

Bulge andwhirl removal

a c b d,h e,g,i f,j

Zigzag path straightening

a b

c

d e f

gh

i

j

Erosion

Page 38: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

IS30 repeat family

Repeats in Bacterial Genome: IS30 repeat family in N. meningitidis

Edge label: 130(2) means length 130bp and multiplicity 2

Page 39: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

The Repeats Graph of N. meningitidis

Page 40: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Repeats in Human Genome (ALU)

Edge label: 17(80-127) means length 17bp and multiplicity between 80-127

Page 41: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Multiple Alignment = Finding Repeats in a Concatenate of all Sequences

Raphael et al., 2004 (Genome Research)

Page 42: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Fragment Assembly in Genome Sequencing

• Genomes are very long (human genome has 3 billions base-pairs)• Current technology can only reliably “read” a short DNA fragment

(read, typically 500 – 1000 base pairs)• Whole genome shotgun sequencing: break genome into millions of

overlapping pieces (read) and sequence each read• Fragment Assembly: assembly of the genome from millions of

overlapping reads• Celera Genomics has assembled human genome (2001)• Over 100 large genomes are waiting for assembly• Current assembly programs may make mis-assembly

– Recent mammalian genome assembly, e.g. mouse, rat, etc, are estimated to have thousands of mis-assemblies

• Repeats in the genomic sequence are the main cause of mis-assembly

Page 43: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Fragment Assembly Using Repeat Graph

Reads

GenomeA B C D E F G H I JC B C D F GC

A B C D I F G H E JC B C D F GC

CA B

FG

D

I

E

HJ

repeat graph

Every possible genome reconstruction corresponds to an Eulerian path in the repeat graph.

Page 44: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Fragment Assembly = Building Repeat Graph from Concatenated Reads

Key idea: The repeat graph built from concatenated reads is identical to the repeat graph built from genomic sequence if the reads “cover” the genomic sequence.

Page 45: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Similarity matrix

… 1 0 … 1 0 0 … 0 0 1 0 …… … … … … … … … … … … … …… 0 1 … 0 0 1 … 0 0 0 1 …… … … … … … … … … … … … …… 0 0 … 0 1 0 … 0 1 0 0 …… 1 0 … 1 0 0 … 1 0 0 0 …… 0 1 … 0 0 1 … 0 0 0 1 …… … … … … … … … … … … … …… 1 0 … 1 0 0 … 1 0 0 0 …… 0 0 … 0 0 0 … 0 0 1 0 …… 0 0 … 0 1 0 … 0 1 0 0 …… 0 1 … 0 0 1 … 0 0 0 1 …… … … … … … … … … … … … …

Fragment Assembly: Building Repeat Graph from Reads

ab

ca

b

c

Page 46: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Fragment Assembly: Building Repeat Graph from Reads

x yrepeat graph

ab

c

Snapshots of similarity matrixa

b

c

Page 47: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

FragmentGluer Algorithm (outline)

• Concatenate reads (in an arbitrary order!) into a single sequence

• Compute the similarity matrix for this concatenated sequence (overlap detection between reads)

• Use this matrix as a “glue” to build the repeat graph with the RepeatGluer algorithm

Page 48: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

FragmentGluer algorithm

• Identify and remove chimeric read;• Concatenate the remaining reads into a sequence and compose

similarity matrix A from the pairwise alignments of reads;• Construct the A-Bruijn graph from the similarity matrix A;• Remove bulges and whirls;• Thread each read through the resulting graph and form the consensus

sequence from reads;• Define the coverage of a vertex in the graph as the number of reads

that are threaded through this vertex;• Define coverage of simple paths as average coverage of their vertices;

Page 49: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

FragmentGluer algorithm (cont’d)

• Form the repeat graph by collapsing simple paths in the graph;• The consensus sequence of an edge in the repeat graph is defined as

the consensus sequence of the corresponding simple path;• Output repeat families as tangles in the repeat graph;• Every tangle is a collection of edges (sub-repeats) with corresponding

consensus sequences;• Transform mate-pairs into mate-paths in the graph obtained and

perform equivalent transformations on the resulting set of mate –paths;• Use mate-pairs to resolve differences between nearly identical copies

of repeats;• Define contigs as consensus sequences of simple paths in the resulting

graph;• Assemble the resulting contigs into scaffolds by the EULER

Scaffolding algorithm

Page 50: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Benchmarking EULER+, Phrap and Arachne on all BACs from Human Chromosome 20

•EULER+ produced the least number of misassembled contigsMisassembled contigs by Phrap: 37Misassembled contigs by ARACHNE: 17Misassembled contigs by EULER+: 7

•EULER+ also had the least number of collapsed repeat copies (4), ahead of Phrap (5) and Arachne (9).

•Average number of contigs per BAC was the least for EULER+ (6.2) followed by Phrap (6.8) and ARACHNE (13.8).

Pevzner et al., 2004 (Genome Research)

Page 51: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Comparison of Euler and Newbler Assemblies

Genome Assembler No. contigs N50 Misassembled Coverage Net size

E. coli Euler 199 46887 3 94.7 4277

Newbler 141 60757 0 99.1 4531

Repeat graph 94 125693 - 92.1 4560

S. pneumoniae Euler 127 32619 1 96.8 2001

Newbler 253 11905 0 95.0 2000

Repeat graph 136 36004 - 97.5 2091

Page 52: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Comparison of Euler and Newbler Assemblies(without flowgrams)

Genome Assembler No. contigs N50 Misassembled Coverage Net size

E. coli Euler 199 46887 3 94.7 4277

Newbler 311 28475 0 99.1 4531

Repeat graph 94 125693 - 92.1 4560

S. pneumoniae Euler 127 32619 1 96.8 2001

Newbler 253 11905 0 95.0 2000

Repeat graph 136 36004 - 97.5 2091

Page 53: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

454 Reads+Sanger Reads

0

10,000

20,000

30,000

40,000

50,000

1x 1.5x 2x 2.5x 3x 3.5x 4x 4.5x 5x

S. pneumoniae S. pneumoniae, 3kb matesS. pneumoniae 30x 454

Sanger coverage

Page 54: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Acknowledgements

• Evan Eichler, Genome Sciences, University of Washington

Page 55: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

A B C D E F G H I JC B C D F GC

CA B

FG

D

I

E

HJ

Sub-repeats:tangle edges

2 copies

2 copies

2 copies

2 copies4 copiesC

FB

D G

repeat graph

Page 56: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 2 3 4

56 7

9

8

10

1411

12

15

141 2,9,13 3,6,10,14 4,7,11

5

8,12

15 1614428328 140 628 1185

2905

381

repeat graph

Sub-repeats:tangle edges

3 copies

3 copies

2 copies

4 copies

Page 57: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Mosaic repeats structure: an imaginary example

A B C D E F G H I J

A B C D E F G H I JC

A B C D E F G H I JC B C D

A B C D E F G H I JC B C D F GC

Page 58: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Mosaic structure of repeats: a real example(BAC G12 from human chromosome Y)

8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 59: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Mosaic structure of repeats

A B C D E F G H I JC B C D F GC

Consensus repeat elements

Repeat boundary problem(Bao & Eddy, 2002)

Mosaic representation 2 copies

2 copies

2 copies

2 copies4 copies

Page 60: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

A-Bruijn graph approach: an overview

All local alignments (genomic dot plot)

A-Bruijn Graph

Repeat Graph

Removing bulges and whirls

Gluing

Page 61: De novo repeat classification and fragment assembly: from de Bruijn to A-Bruijn graphs · 2012-03-06 · De novo repeat classification and fragment assembly with A-Bruijn graph approach

Bacterial genomes assembly

All 5 bacterial genomes were mis-assembled by Phrap (either repeat collapsing or joining non-contiguous regions or both). ARACHNE and EULER+ make no assembly errors except for one genome.