DNA Fragment Assembly

DNA Fragment Assembly

CIS 667 Spring 2004February 18

Objectives

• The problem: DNA Fragment Assembly The ideal case The complications

• Models The Shortest Common Superstring Reconstruction Multicontig

• A greedy algorithm• Heuristics

The Problem

• Assumption: We know the length of the target sequence approximately

• The problem: Given a set of fragments from DNA, we want deduce the whole sequence of the DNA. We determine only one of the strands

of the original molecule

The ideal caseInput:1. The set of fragments:

ACCGTCGTGCTTACTACCGT

2. Total length 10bp

Output:_ _ A C C G T _ _ _ _ _ _ C G T G CT T A C _ _ _ _ __ T A C C G T _ _

T T A C C G T G C (consensus by majority of votes)

Complications

1. real problem instance is very large2. errors

• substitutions• insertions• deletions• chimeras

3. unknown orientation of the fragments4. repeated regions

• causes ambiguity in sequencing5. lack of coverage

• causes gaps

Errors: SubstitutionInput:1. The set of fragments:

ACCGTCGTGCTTACTGCCGT substitution


Output:_ _ A C C G T _ _ _ _ _ _ C G T G CT T A C _ _ _ _ __ T G C C G T _ _

T T A C C G T G C (consensus by majority of votes)

Errors: InsertionInput:1. The set of fragments:

ACCGTCAGTGC insertionTTACTACCGT


Output:_ _ A C C * G T _ _ _ _ _ _ C A G T G CT T A C _ * _ _ _ __ T A C C * G T _ _

T T A C C * G T G C (consensus by majority of votes)

Errors: ChimerasInput:

1. The set of fragments:

ACCGT, CGTGC, TTAC, TACCGT, TTATGC2. Total length 10bp

Output:_ _ A C C G T _ _ _ _ _ _ C G T G CT T A C _ _ _ _ __ T A C C G T _ _

T T A C C G T G C (consensus)

T T A _ _ _ T G CA chimera arises when two regular fragments

from distinct parts of the target molecule join end-to end

Remedy: recognize them before use!

Repeated Regions

• Unknown orientation with no errors• Unknown orientation with errors• Repeated regions causes ambiguity

P X Q X R X S

P X R X Q X S

Direct repeat

• Direct repeat• More complex are inverted repeat

repeated regions in opposite strands

P X Q Y R X S Y

P X S Y R X QY

Lack of coverage

• causes formation of gaps• compute the mean coverage

add up all the fragments and divide by the target length

• insufficient coverage is covered by sampling more fragments

• How many fragments do I need?• Assume

all fragments have the same length let t be the safe overlap of at least t bases n is the number of fragments T is the target length

Apparent contigs: p = n e –n(l-t)/T

Shortest Common Superstring

Input: A collection F of stringsOutput: A shortest possible string S |

fF, S is a superstring of f.

Example: F={ATG, TGC, GCC}S= ATGCC

Question: Is it the shortest?Observe: u=ATG and v=GCC overlap

in G and TGC is a substring

Shortest Common Superstring

Is it a good problem? Advantages: • The problem finds the PERFECT superstring• Good for most ideal casesDisadvantages:

the problem does not deal with errors good only in some ideal cases

• in presence of no errors and known orientation, it fails in presence of repeat

• repeated identical copies get absorbed in the search of the SHORTEST superstring and produces an assembly of uneven coverage

It does not consider lack of coverage and size of the target

NP-hard

Reconstruction

Objective:We want to consider errors and unknown orientation

Substring Edit Distance

ds(a,b) = min dss(b)(a,s) one unit is charged for insertion, deletion, substitution no charges for deletion in the extremity of 2nd sequence

Example u=CGATGT v=AACTAATGTGC

_ _ C G A * T G T _ _ A A C T A A T G T G C

ds(u,v) = 2

• A string f is an approximate substring of S at error level (between 0, 1) when ds(f,S) |f|

Reconstruction

Input: A collection F of strings, an error tolerance with 0 1

Output: A shortest possible string S | fF, we have

min(ds(f, S), ds(f ,S)) |f|where f is the reverse complement

Advantage: takes into account errors and unknown

orientationDisadvantages:

Is an NP-hard problem It does not model repeats It does not consider lack of coverage and size

of the target

Multicontig

Objective: We want to consider internal linkageNo special assumptions except:

for known orientation, fragment and reverse complement are not both present in the collection.

• We want to have good linkage (overlap between fragments) An overlap is a link if it is not (properly)

contained in a bigger fragment The smallest size of a link in a layout is called a

weakest link A layout is a t-contig if its weakest link is at

least size t We partition F into the minimum number of

collections which admit a t-contig

Multicontig

Idea: Let's partition F in the minimum number of t-contigs!

Example: F={GTAG, TAATG, TGTAA}

for t=3 F1={TAATG, TGTAA} and F2={GTAG}

for t=2 we have two solutions1. F1={TAATG, TGTAA} and F2={GTAG} 2. F1={TAATG, TGTAA} and F2={GTAG}

for t=1 we have the desired solution (the minimum)F1={TAATG, TGTAA, GTAG}

For errors, we use the consensus of the multi-alignment and insist that the edit distance of the fragments be small

MulticontigInput: A collection F of strings, and an integer t0 and an error

tolerance with 0 1Output: A partition of F in the minimum number of subcollections

Ci. 1i k | every Ci admits a t-contig with an -consensus

Advantage: takes into account errors and unknown orientation take into account internal linkage of the fragments

• the answer is formed by several contigsDisadvantages:

Is an NP-hard problem even in the simplest case of no errors and known orientation• It contains as a special case finding a Hamiltonian

path in a restricted class of graphs It has no provision to use information on the

approximate size of the target

Overlap Multigrapht-Overlap:

suffix(a,t) = prefix(b,t) or (at)b = a(tb) or

|a|-ta = b|b|-t

Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?

CTAAAG

TACGG

GGACG

GCCC

2 1



|a|-ta = b|b|-t

CTAAAG

TACGG

GGACAG

GCCC

2 1




|a|-ta = b|b|-t


CTAAAG

TACGG

GGACAG

GCCC

2 1



|a|-ta = b|b|-t

CTAAAG

TACGG

GGACAG

GCCC

2 1


Theoretical results

Theorem: the total length of A (set of fragments) is||A|| = w(P) + |S(P)|

where

• ||A||=aA |a|• w(P) is the weight of the path P • |S(P)| is the length of the superstring derived from P.

to convince yourself, read the proof from the book

Other theoretical results: Looking at the shortest common superstring is the same as looking for the Hamiltonian path of maximum weight in a directed multigraph.

The Greedy Methodology

NP-hard problems cannot be solved in reasonable time, but we can look for approximate solutions in reasonable time

To apply a greedy methodology:1. the problem must show optimal substructure

– A problem exhibits optimal substructure if the optimal solution to a problem contains within it optimal solutions to other problems

2. the optimal solution is reached by taking the best "local" choice

Overlap Graph

An overlap graph has only edges with maximum weight

CTAAAG

TACGA

GACA

ACCC

2 1

The Greedy Algorithminput: weighted di-graph OG(F) with n verticesoutput: Hamiltonian path in OG(F)//Initializefor i 1 to n do

in[i]=0 //how many selected edges enter iout[i]=0 //how many selected edges exit iMakeSet(i)

//ProcessSort the edges by weight, heaviest firstfor each edge (f,g) in this order do //test for acceptance if in[g] = 0 and out[f] = 0 and FindSet(f) ≠FindSet(g)

select (f,g)in[g]1out[f] 1

Union(FindSet(f), Findset(g)) if there is only one component breakreturn selected edges

A graph where "greedy" fails

• F={GCAAAG, AGTA,TACGA}

GCAAAG

TACGA

AGTA

2

We order the edges by weight(AGAT, GCAAAG) = 3(GCAAAG, AGTA) =2(AGTA, TACGA) = 2

The algorithm will choose first(AGAT, GCAAAG) = 3 and then is forced to select an edge with weight 0 to complete the path.

Instead the solution should be (GCAAAG, AGTA) =2(AGTA, TACGA) = 2

Observations

Local optimal decisions do not always work. Can we do any better?

Use some heuristics.

Issues:ScoringCoverageLinkage

Heuristics• Scoring

Uniformity is good, variability is bad. Compute the entropy of a column the entropy is the measure of the chaos in a column. There

are 5 possible characters, A, T, C, G, spaceE=-cpclog pc

E=0 if pc=1 for a character; E=log5 if each pc=1/5 To measure the uniformity we want a low entropy per

column• Coverage

minimun, maximum or medium coverage if the coverage reaches 0 for a column I, we do not have a

connected layout if we have more columns with zero coverage, any

permutation of the intervening regions (the contig) is acceptable

Coverage gives confidence to the consensus Linkage

• High coverage with no links is not good. Overlap is required.

More ObservationsLocal optimal decisions do not always work. Can we

do any better?Use some heuristics.

• Assembly in practice consists of:1. Finding overlaps2. Building Layout3. Computing the consensus

Advantages: • We treat each problem separately.

Disadvantages: • It becomes difficult to understand

the relationship between the input and the final output

Heuristics

• Finding overlaps use a dynamic programming approach

with a score system such as 1 for matches -1 for mismatches -2 for spaces Do not charge for space after the first

sequence and before the second one.

Heuristics

Ordering Fragments there is no algorithm simple and general

enough

Considerations: Use the set DF=F F If f=uv g=wx then g =wx f =

vu if f is approximately the same as the

beginning of g we can expect that whatever is the criterion used to assess the similarity between f and g, the same criterion will apply to their reverse complement

Finding overlaps Finding a good ordering of overlapping means

finding a direct path in the overlap graph Both strands are constructed simultaneously Contained fragment are not essential in the path A disconnected graph indicates lack of coverage The presence of cycles indicates repeats Unusual high coverage indicate possible repeats The presence of reverse complement cycles indicates

inverted repeats

Alignment and Consensus

• Use the minimal sum of the distancesSuppose we have f g h

CATAGTCTAACTATAGACTATCC

Two semiglobal aligments for f and g are:C A TAG T C_ _ _ C ATA GT C_ _ _

_ _ TAA _ C TA T _ _TA_ A CT A T

C ATA GT C_ _ _

_ _TA _ A CT A T_ _ _A G A CT A T C C CATA GA C T A T C C

ds(f, S) = 1 ds(g, S) = 1 ds(h, S) = 0 if we use the second aligment and ds(f, S) +ds(g, S) +ds(h, S) = 2

ds(f, S) = 1 ds(g, S) = 2 ds(h, S) = 0 if we use the first aligment and A is chosen for column 6, ds(f, S) +ds(g, S) +ds(h, S) = 3

A Linked List of Bases

• Sometimes we know what is best only later. Is there a structure that helps us? Use a Linked List of Bases

matches bases are unified in one node unmatched bases are left separate

Technique: Traverse this graph in topological order.

G T C A T A C T A T

A

Conclusions• The models fail to address all the issues

involved in the problem• The effective real problem is NP-hard• Approximation gives us some help but fails

in some cases• Heuristics helps and the problem is broken

in 3 smaller problems: 1. finding overlap 2. building layout and 3. computing the consensus

• Are we sure there is nothing else to do? We will look next week at the smaller problem of

comparing only two sequences instead of many. Will we find something better?

DNA Fragment Assembly

Documents

Transcript of DNA Fragment Assembly