Fragment Assembly

31
06/11/22 1 Fragment Assembly

description

Fragment Assembly. Introduction. Fragments are typically of 200-700 bp long “Target” string is about 30k – 100k bp long Problem: given a set of fragments reconstruct the target. Introduction. Multiple-alignment of the fragments ignoring spaces at the end The alignment is called “layout” - PowerPoint PPT Presentation

Transcript of Fragment Assembly

Page 1: Fragment Assembly

04/22/23 1

Fragment Assembly

Page 2: Fragment Assembly

04/22/23 2

Introduction Fragments are typically of 200-700

bp long

“Target” string is about 30k – 100k bp long

Problem: given a set of fragments reconstruct the target

Page 3: Fragment Assembly

04/22/23 3

Introduction Multiple-alignment of the fragments

ignoring spaces at the end

The alignment is called “layout”

The output is called the “consensus sequence”

An optimization problem

Page 4: Fragment Assembly

04/22/23 4

Complications Base-call errors: Substitution errors [p 107] Insertion errors (possibly from the

host sequence) [p 108, fig 4.3] Deletion error [fig 4.4] Majority voting solves them (or

some form of optimization)

Page 5: Fragment Assembly

04/22/23 5

Complications Chimeras: To non-contiguous fragments get

joined as a single fragment [p 109, fig 4.5]

Needs to be weeded out as a preprocessing step

Similar to chimeras, contaminant fragments (possibly from host) needs to be filtered out as well

Page 6: Fragment Assembly

04/22/23 6

Complications Unknown orientation: Fragments may come from either strand Even from the opposite strand, its

reverse-complement must be in the target string

Consequence: try both forward and rev-complement of each fragment (2^n trial in worst, for n fragments)

[p 109, fig 4.6]

Page 7: Fragment Assembly

04/22/23 7

Complications Repeats: Regions (super-string of some

fragments) may repeat in a target Consequent problem: where do the

fragments really come from, on approximate alignment? [p 110, fig 4.7]

Problem 2: where should the inter-repeat fragments go? [p111, fig 4.8, fig 4.9]

Inverted repeats: repeat of the reverse complement [fig 4.10]

Page 8: Fragment Assembly

04/22/23 8

Complications Insufficient coverage: Chance of coverage increases with

redundancy (a heuristic: cover 8 times the target length)

Chance of covering a gap reduces when it remains uncovered even after multiple fragments are aligned): random sampling is not good solution here

Page 9: Fragment Assembly

04/22/23 9

Complications Insufficient coverage: What you get with insufficient coverage

is multiple “contigs,” not one contig “t-contig” is where we expect t-long

overlap between pairs of fragments Expected number of contigs: [p 112,

formula 4.1] Lower t means lesser number of contigs

(more aligned segments), but weaker consensus

Page 10: Fragment Assembly

04/22/23 10

Reconstruction Shortest common superstrings are

not the best solution Fig 4.12 vs Fig 4.13 (p115/116)

Page 11: Fragment Assembly

04/22/23 11

Reconstruction Superstring to be reconstructed out of

fragments An alignment problem with no end penalty d_s is edit distance score without end-

penalty: minimized over edit distances d Fig 4.14 (p117) for best aligned

subsequence-matching Note, char matched is charged 0,

mismatch 1, gap 2, in “distance” rather than “similarity”

We will use d for d_s

Page 12: Fragment Assembly

04/22/23 12

Reconstruction f is approximate substring of S at

error level e, then the score isd(f, S) =< e|f|,

e=1 means no error allowede<1 allows insert/delete/substitution

errors f and f- both should be matched

Page 13: Fragment Assembly

04/22/23 13

Reconstruction: Problem Input: Set F of substrings, error

level e Output: Shortest possible string S

s.t. for all f Min(d(f, S), d(f-, S)) =< e|f|

Page 14: Fragment Assembly

04/22/23 14

Reconstruction: Multicontig How much overlap do we require

between strings? Ideally, each column in the layout L

should have same character, for all columns 1 through |L|

Fig 4.4 (p 118): t-contig for t=3, 2, 1

Balance between t and number of t-contigs

Page 15: Fragment Assembly

04/22/23 15

Reconstruction: Multicontig S is e-consensus sequence

(multicontig) for 0=<e=<1: edit distance d(f, S) =< e|f|

Multicontig problem: Input: set F, integer t>=0, 0=<e=<1 Output: Minimum partition over F,

each partition Ci is a t-contig with e-consensus

Page 16: Fragment Assembly

04/22/23 16

Reconstruction: Overlap Multi-graph Nodes are the fragments Directed arcs label length t of overlap

between nodes” t-suffix= t-prefix Arcs between all pairs of nodes, but no self-

loop Fig 4.15 (p 121): example Length of a created superstring=total wt

along the path(or overlaps) + total length of all fragments involved

Max weight Hamiltonian path is what we are looking for in this graph max overlapped superstring

Page 17: Fragment Assembly

04/22/23 17

Reconstruction Substrings of fragments within the

set of fragments are noise: remove them

Draw OMG of the substring free set of fragments

Shortest common superstring always correspond to a Hamiltonian path in this graph

Page 18: Fragment Assembly

04/22/23 18

Reconstruction: OMG Thm 4.1 (p 123): F substring free, for

every common superstring S, there is a Ham. Path P, s.t., S(P) is in S

Substrings are strictly ordered over S: order of left pts = order of rt points (otherwise substring exists)

Path follows the same order of fragments (as in S) in OMG

S may contain extra garbage materials, so, S(P) is within S

Page 19: Fragment Assembly

04/22/23 19

Reconstruction: OMG If S is shortest common

superstring, then S must be within S(P), or S=S(P)

In other words, a Ham. Path in OMG for substring-free collection F’ is a shortest common superstring of the Fragment set F

Page 20: Fragment Assembly

04/22/23 20

Reconstruction: OMG Think of an algorithm for weeding out

substrings from F

Also, weed out multi-edges by keeping the largest wt edge between any pair of nodes

If the wt on an edge is below a threshold t, then the wt should be treated as 0

Page 21: Fragment Assembly

04/22/23 21

Reconstruction: OMG Greedy Algorithm to draw Ham. Path (p 125) Collects edges largest to smallest,

(1) preventing cycle (union-find), (2) indegree of each node should be =<1 (first node has 0)(3) outdegree of each node should be =<1 (last node has 0)

[Does not return Ham. Path. Can you modify to return Ham. Path?]

Alg is NOT optimal, example (p 126): returns 3, optimal wt is 4

Page 22: Fragment Assembly

04/22/23 22

Reconstruction: OMG Subintervals: if a fragment can be

embedded within another one in the set

Subinterval-free and repeat-free graphs connected at level t has a Ham. Path that generates the target string

Page 23: Fragment Assembly

04/22/23 23

Reconstruction: OMG If a repeat exists in the original string,

then the graph will have a cycle False positive: substrings from two

different portions has t-overlap If a cycle exist in the graph, then there

must be a “false positive” (Thm 4.4, p129): proof by contradiction, otherwise the subinterval-free fragments can be totally ordered

Page 24: Fragment Assembly

04/22/23 24

Reconstruction: OMG If there is no repeats in a

subinterval-free graph, then there exist a unique Ham. Path

If there exist a cycle it may not come from a repeat

Page 25: Fragment Assembly

04/22/23 25

Reconstruction: OMG Example 4.6 (p 130): greedy alg

finds wrong string, but the Ham. Path finds the correct one

Greedy does not care about linkage (optimizes on total overlap – finds shortest common superstring)

Ham path chooses any t-overlap connections – cares for linkage only

Page 26: Fragment Assembly

04/22/23 26

Parameters in aligning for fragment assembly Score on a column: traditionally {0,-1,-

2} in sum-of-pairs Entropy:

Sum[over alphabets and space c] –pc log pc, where pc is probability of c

All same character, pc = 1, entropy=0 For {a, t, c, g, -}, all different, pc = 1/5,

entropy=log 5entropy measures uniformity alone, a better metric

Page 27: Fragment Assembly

04/22/23 27

Parameters in aligning for fragment assembly Coverage: How many each column is

“covered” by how many fragments? (Average, min, max)

This is different from the concept of t-overlap

If a column (of the target) is covered by 0, then the layout is disconnected

Counteracts with the requirement of subinterval-free collection if we expect coverage>1 for all columns

Page 28: Fragment Assembly

04/22/23 28

Parameters in aligning for fragment assembly Coverage is not enough, we need

good linkage, Example: p 133 Ham. Path algorithm is doing that

Page 29: Fragment Assembly

04/22/23 29

Steps in assembly : Step 1: Overlap finding Approximate – delete, insert,

replace allowed by semi-global DP algorithm with appropriate end-gap penalty, pairwise between each fragment

and its reverse-complement

Page 30: Fragment Assembly

04/22/23 30

Steps in assembly : Step 2: Construct over (F union F-

bar) for the fragment set F (-- after eliminating substrings?) Construct Hamiltonian path in this

graph Cycles and unbalanced coverage

may mean repeats

Page 31: Fragment Assembly

04/22/23 31

Steps in assembly : Step 3: fine tuning the multiple

alignment to get a consensus target

Manual or algorithmic Examples in p 137-138