Post on 01-Feb-2022
Genome Assembly: The Orientation Problem
Paul M. Bodily Dr. Mark Clement
Computational Sciences Laboratory Department of Computer Science
Brigham Young University
Background: Genome Assembly
TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACC ATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCTAATTTTTAGCTA TGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGG ATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTA CAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGG GTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGG TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTG GCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCTAATTT CTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAG TTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTGAATACC AGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTG TCAAATAGTCCAGTAGAGGGCAGTCCACCAG CTAATTTTTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTT GCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCT TAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAG CTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT GGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTT CTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCA GCTAATTTTTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT TTTTTTAATGTTTACATTTATCTCTATGTTTACCTT GAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCTAATTTTTAGCTA TTTAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCA ATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGT GCTAATTTTTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT GCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTT
Background: Genome Assembly
Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGRead 7:! CGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…
Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…
Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…
Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG
Reads
Possible assemblies:
Background: Genome Assembly
Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGRead 7:! CGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…
Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…
Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…
Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG
Reads
Possible assemblies:
Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGMissing Read 7: ?????????????????????????????????????????…
Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAG?
Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…
Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG
Reads
Possible assemblies:
Background: Genome Assembly
Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGRead 7:! CGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…
Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…
Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…
Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG
Reads
Possible assemblies:
Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGRead 7:! CGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…
Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…
Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…
Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG
Reads
Possible assemblies:
Background: Genome Assembly
TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCTAATTTTTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT
TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTCAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCTGATTTTTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT
Maternal: Paternal:
Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGRead 7:! CGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…
Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…
Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…
Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG
Reads
Possible assemblies:
6
4 8
3 4
4
Background: Genome Assembly
TTTTTTAA…ATTGACC ATCTTTGC…TTTAGCTA
ATATCTAGC…TACCTGTT TTTTTTAA…ACCTTTTTA
TGCTGGCT…GAGGTTAGG CGTTT…GGTTTCAC
TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTT
TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTCAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTT
Problem Statement
Given a graph where nodes represent DNA sequences and edges represent possible scaffoldings of these sequences, find the subgraph which maximizes edge weight such that in any traversal of the remaining subgraph a node can only
be traversed in a single direction, and The subgraph is void of cycles.
Related Problem: Minimum Spanning Tree
Given a connected weighted graph, find a minimum spanning tree.
http://en.wikipedia.org/wiki/Minimum_spanning_tree
Review: Kruskal’s Algorithm (1956)
Create a forest F (a set of trees), where each vertex in the graph is a separate tree
Create a set S containing all the edges in the graph
While S is nonempty and F is not yet spanning remove an edge with minimum weight from S if that edge connects two different trees, then add it to the
forest, combining two trees into a single tree otherwise discard that edge.
At the termination of the algorithm, the forest forms a minimum spanning forest of the graph. If the graph is connected, the forest has a single component and forms a minimum spanning tree.
Orientation Problem: Proposed Solution
Create a forest F (a set of trees), where each vertex in the graph is a separate tree
Create a set S containing all the edges in the graph
While S is nonempty remove an edge with maximum weight from S if that edge connects two different trees, then add it to
the forest, combining two trees into a single tree (reconcile orientations)
if that edge connects two nodes within the same tree and suggests a consistent orientation, then add it to the tree
otherwise discard that edge.
Orientation Problem: Proposed Solution (cont.)
(reconcile orientations) – if two trees have consistent orientations within themselves, but have an edge between them suggesting an inconsistent orientation
Orientation Problem: Proposed Solution (cont.)
(reconcile orientations) – if two trees have consistent orientations within themselves, but have an edge between them suggesting an inconsistent orientation
then we can arbitrarily toggle the orientation of all nodes in one of the trees while maintaining consistent orientation, allowing us then to add the inter-tree edge.
(reconcile orientations) – if two trees have consistent orientations within themselves, but have an edge between them suggesting an inconsistent orientation
then we can arbitrarily toggle the orientation of all nodes in one of the trees while maintaining consistent orientation, allowing us then to add the inter-tree edge.
Orientation Problem: Proposed Solution (cont.)
Orientation Problem: Proposed Solution
Create a forest F (a set of trees), where each vertex in the graph is a separate tree
Create a set S containing all the edges in the graph
While S is nonempty remove an edge with maximum weight from S if that edge connects two different trees, then add it to
the forest, combining two trees into a single tree (reconcile orientations)
if that edge connects two nodes within the same tree and suggests a consistent orientation, then add it to the tree
otherwise discard that edge.
Cycles
In a bidirected graph, cycles become significantly more complex
Although applying the same rule would prevent cycles, it would also remove edges that did not create cycles
ATCTTTGC…TTTAGCTA
ATATCTAGC…TACCTGTT TTTTTTAA…ACCTTTTTA
TGCTGGCT…GAGGTTAGG CGTTT…GGTTTCAC
TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTT
TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTCAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTT
What does the green subgraph mean?!
TGCTGGCT…GAGGTTAGG
TTTTTTAA…ATTGACC
Results takehomes
In reality, most inconsistent orientations derive from spurious edges (errors)
Inconsistent orientations do not happen very often
Inversions and repeats: our biological assumption holds true most of the time, but not always
Algorithm works well for the problem it is designed to solve