A hierarchical approach to building contig scaffolds

23
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp. 149-159, 2004.

description

A hierarchical approach to building contig scaffolds. Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp. 149-159, 2004. Sequencing pipeline. Random sequencing un-related reads 500-700 base-pairs Assembly un-related contigs 5,000-10,000 base-pairs Scaffolding - PowerPoint PPT Presentation

Transcript of A hierarchical approach to building contig scaffolds

Page 1: A hierarchical approach to building contig scaffolds

A hierarchical approach to building contig scaffolds

Mihai PopDan Kosack

Steven L. SalzbergGenome Research 14(1), pp. 149-159, 2004.

Page 2: A hierarchical approach to building contig scaffolds

Sequencing pipeline

• Random sequencing• un-related reads 500-700 base-pairs

• Assembly• un-related contigs 5,000-10,000 base-pairs

• Scaffolding• un-related scaffolds 30,000-50,000 base-pairs

• Finishing/gap closure• completed genomes millions-billions of base-

pairs

Page 3: A hierarchical approach to building contig scaffolds

Scaffolding

• Given a set of non-overlapping contigsorder and orient them along a chromosome

III III IV

I

IIIII

IV

Page 4: A hierarchical approach to building contig scaffolds

Clone-mates

Clone

Insert

F R

FR

I II

R

I

F

II

F

II

R

I

Page 5: A hierarchical approach to building contig scaffolds

Scaffolder output

Sequencing gaps

Physical gaps

Page 6: A hierarchical approach to building contig scaffolds

Problems with the data

• Incorrect sizing of inserts• cut from gel – sizing is subjective• error increases with size

• Chimeras (ends belong to different inserts)• biological reasons (esp. for large sized inserts)• sample tracking (human error)

• Software must handle a certain error rate.

Page 7: A hierarchical approach to building contig scaffolds

Theoretical abstraction

• Given a set of entities (reads/contigs) and constraints between them (overlaps/mate pairs) provide a linear/circular embedding that preserves most constraints.

Page 8: A hierarchical approach to building contig scaffolds

Graph representation• Nodes: contigs• Directed edges: constraints on relative

placement of contigs – relative order and relative orientation

• Embedding: order (coordinate along chromosome) and orientation (strand sampled)

Page 9: A hierarchical approach to building contig scaffolds

Challenges

• Orientation – node coloring problem (forward/reverse)• feasibility – no cycles with odd number of

“reversal” edges (blue edges)• optimality – remove minimum number of edges

such that a solution exists (NP-hard)

Page 10: A hierarchical approach to building contig scaffolds

Challenges

• Ordering – generate a linear embedding• feasibility – lengths of parallel DAG paths are

consistent• optimality – remove minimum number of edges

such that DAG is feasible (NP-hard)

Page 11: A hierarchical approach to building contig scaffolds

The real world

• Use of scaffolds• Analysis – longest unambiguous sub-graphs• Finishing – present all “reliable” relationships

between contigs• Sources of error

• mis-assemblies• sizing errors (increases with library size)• chimeras

Page 12: A hierarchical approach to building contig scaffolds

Ambiguous scaffold

I

II

III

I II III

I IIIII

I III I’ II

Page 13: A hierarchical approach to building contig scaffolds

Hierarchical scaffolding

1. For each contig pair, consolidate all linking data into a single relationship – 2 correct links required

Page 14: A hierarchical approach to building contig scaffolds

Hierarchical scaffolding

2. Use most reliable links to build scaffolds

3. Repeatedly build super-scaffolds based on less reliable linking data

Page 15: A hierarchical approach to building contig scaffolds

Rationale

Hierarchical step

Problemcomplexity

problem size (#nodes)error rate

Page 16: A hierarchical approach to building contig scaffolds

Linking information

• Overlaps

• Mate-pair links

• Similarity links

• Physical markers

• Gene synteny

reference genome

physical map

Page 17: A hierarchical approach to building contig scaffolds

BAMBOO (BAMBUS)Best effort Attempt

Multiple Branches allowedOrder, Orient

Page 18: A hierarchical approach to building contig scaffolds

Inputs

• Set of contigs: names and lengths• Groups of contig links:

• groups correspond to “quality” of links• link: relative distance between contig origins

relative orientation of contigs

• Priorities for each group – specify order in which links are considered

BE EB

min <= dist <= max

Page 19: A hierarchical approach to building contig scaffolds

Outputs• XML representation of layout:

• contig orientations• contig position (x-coordinate of contig origin)• links used to construct layout

• Graphical display of the layout• uses GraphViz package from AT&T

Page 20: A hierarchical approach to building contig scaffolds

1.0 release

• XML input not yet supported

• All scaffold placed in the same output file

• Only Linux executable released

• Hacker friendly

Page 21: A hierarchical approach to building contig scaffolds

Current release: 2.33

• XML input and more general output module• Collection of input modules from common

assembly formats• Better handling of priority data• Repeat masking features• More platforms supported and source code

released as open source• http://amos.sourceforge.net

Page 22: A hierarchical approach to building contig scaffolds

Future enhancements

• Option to generate un-ambiguous (non-branching) scaffolds

• Better layout algorithms• Specialized drawing tools• Interactive browser

• Represent/handle multiple haplotypes

Page 23: A hierarchical approach to building contig scaffolds

Acknowledgements

• Dan Kosack• Steven Salzberg• Martin Shumway

• Hean Koo• Luke Tallon• Jessica Vamathevan