A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome...

23
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp. 149-159, 2004.

Transcript of A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome...

A hierarchical approach to building contig scaffolds

Mihai PopDan Kosack

Steven L. SalzbergGenome Research 14(1), pp. 149-159, 2004.

Sequencing pipeline

• Random sequencing• un-related reads 500-700 base-pairs

• Assembly• un-related contigs 5,000-10,000 base-pairs

• Scaffolding• un-related scaffolds 30,000-50,000 base-pairs

• Finishing/gap closure• completed genomes millions-billions of base-

pairs

Scaffolding

• Given a set of non-overlapping contigsorder and orient them along a chromosome

III III IV

I

IIIII

IV

Clone-mates

Clone

Insert

F R

FR

I II

R

I

F

II

F

II

R

I

Scaffolder output

Sequencing gaps

Physical gaps

Problems with the data

• Incorrect sizing of inserts• cut from gel – sizing is subjective• error increases with size

• Chimeras (ends belong to different inserts)• biological reasons (esp. for large sized inserts)• sample tracking (human error)

• Software must handle a certain error rate.

Theoretical abstraction

• Given a set of entities (reads/contigs) and constraints between them (overlaps/mate pairs) provide a linear/circular embedding that preserves most constraints.

Graph representation• Nodes: contigs• Directed edges: constraints on relative

placement of contigs – relative order and relative orientation

• Embedding: order (coordinate along chromosome) and orientation (strand sampled)

Challenges

• Orientation – node coloring problem (forward/reverse)• feasibility – no cycles with odd number of

“reversal” edges (blue edges)• optimality – remove minimum number of edges

such that a solution exists (NP-hard)

Challenges

• Ordering – generate a linear embedding• feasibility – lengths of parallel DAG paths are

consistent• optimality – remove minimum number of edges

such that DAG is feasible (NP-hard)

The real world

• Use of scaffolds• Analysis – longest unambiguous sub-graphs• Finishing – present all “reliable” relationships

between contigs• Sources of error

• mis-assemblies• sizing errors (increases with library size)• chimeras

Ambiguous scaffold

I

II

III

I II III

I IIIII

I III I’ II

Hierarchical scaffolding

1. For each contig pair, consolidate all linking data into a single relationship – 2 correct links required

Hierarchical scaffolding

2. Use most reliable links to build scaffolds

3. Repeatedly build super-scaffolds based on less reliable linking data

Rationale

Hierarchical step

Problemcomplexity

problem size (#nodes)error rate

Linking information

• Overlaps

• Mate-pair links

• Similarity links

• Physical markers

• Gene synteny

reference genome

physical map

BAMBOO (BAMBUS)Best effort Attempt

Multiple Branches allowedOrder, Orient

Inputs

• Set of contigs: names and lengths• Groups of contig links:

• groups correspond to “quality” of links• link: relative distance between contig origins

relative orientation of contigs

• Priorities for each group – specify order in which links are considered

BE EB

min <= dist <= max

Outputs• XML representation of layout:

• contig orientations• contig position (x-coordinate of contig origin)• links used to construct layout

• Graphical display of the layout• uses GraphViz package from AT&T

1.0 release

• XML input not yet supported

• All scaffold placed in the same output file

• Only Linux executable released

• Hacker friendly

Current release: 2.33

• XML input and more general output module• Collection of input modules from common

assembly formats• Better handling of priority data• Repeat masking features• More platforms supported and source code

released as open source• http://amos.sourceforge.net

Future enhancements

• Option to generate un-ambiguous (non-branching) scaffolds

• Better layout algorithms• Specialized drawing tools• Interactive browser

• Represent/handle multiple haplotypes

Acknowledgements

• Dan Kosack• Steven Salzberg• Martin Shumway

• Hean Koo• Luke Tallon• Jessica Vamathevan