DNA Fragment Assembly

39
DNA Fragment Assembly CIS 667 Spring 2004 February 18

description

DNA Fragment Assembly. CIS 667 Spring 2004 February 18. Objectives. The problem: DNA Fragment Assembly The ideal case The complications Models The Shortest Common Superstring Reconstruction Multicontig A greedy algorithm Heuristics. The Problem. - PowerPoint PPT Presentation

Transcript of DNA Fragment Assembly

Page 1: DNA Fragment Assembly

DNA Fragment Assembly

CIS 667 Spring 2004February 18

Page 2: DNA Fragment Assembly

Objectives

• The problem: DNA Fragment Assembly The ideal case The complications

• Models The Shortest Common Superstring Reconstruction Multicontig

• A greedy algorithm• Heuristics

Page 3: DNA Fragment Assembly

The Problem

• Assumption: We know the length of the target sequence approximately

• The problem: Given a set of fragments from DNA, we want deduce the whole sequence of the DNA. We determine only one of the strands

of the original molecule

Page 4: DNA Fragment Assembly

The ideal caseInput:1. The set of fragments:

ACCGTCGTGCTTACTACCGT

2. Total length 10bp

Output:_ _ A C C G T _ _ _ _ _ _ C G T G CT T A C _ _ _ _ __ T A C C G T _ _

T T A C C G T G C (consensus by majority of votes)

Page 5: DNA Fragment Assembly

Complications

1. real problem instance is very large2. errors

• substitutions• insertions• deletions• chimeras

3. unknown orientation of the fragments4. repeated regions

• causes ambiguity in sequencing5. lack of coverage

• causes gaps

Page 6: DNA Fragment Assembly

Errors: SubstitutionInput:1. The set of fragments:

ACCGTCGTGCTTACTGCCGT substitution

2. Total length 10bp

Output:_ _ A C C G T _ _ _ _ _ _ C G T G CT T A C _ _ _ _ __ T G C C G T _ _

T T A C C G T G C (consensus by majority of votes)

Page 7: DNA Fragment Assembly

Errors: InsertionInput:1. The set of fragments:

ACCGTCAGTGC insertionTTACTACCGT

2. Total length 10bp

Output:_ _ A C C * G T _ _ _ _ _ _ C A G T G CT T A C _ * _ _ _ __ T A C C * G T _ _

T T A C C * G T G C (consensus by majority of votes)

Page 8: DNA Fragment Assembly

Errors: ChimerasInput:

1. The set of fragments:

ACCGT, CGTGC, TTAC, TACCGT, TTATGC2. Total length 10bp

Output:_ _ A C C G T _ _ _ _ _ _ C G T G CT T A C _ _ _ _ __ T A C C G T _ _

T T A C C G T G C (consensus)

T T A _ _ _ T G CA chimera arises when two regular fragments

from distinct parts of the target molecule join end-to end

Remedy: recognize them before use!

Page 9: DNA Fragment Assembly

Repeated Regions

• Unknown orientation with no errors• Unknown orientation with errors• Repeated regions causes ambiguity

P X Q X R X S

P X R X Q X S

Page 10: DNA Fragment Assembly

Direct repeat

• Direct repeat• More complex are inverted repeat

repeated regions in opposite strands

P X Q Y R X S Y

P X S Y R X QY

Page 11: DNA Fragment Assembly

Lack of coverage

• causes formation of gaps• compute the mean coverage

add up all the fragments and divide by the target length

• insufficient coverage is covered by sampling more fragments

• How many fragments do I need?• Assume

all fragments have the same length let t be the safe overlap of at least t bases n is the number of fragments T is the target length

Apparent contigs: p = n e –n(l-t)/T

Page 12: DNA Fragment Assembly

Shortest Common Superstring

Input: A collection F of stringsOutput: A shortest possible string S |

fF, S is a superstring of f.

Example: F={ATG, TGC, GCC}S= ATGCC

Question: Is it the shortest?Observe: u=ATG and v=GCC overlap

in G and TGC is a substring

Page 13: DNA Fragment Assembly

Shortest Common Superstring

Is it a good problem? Advantages: • The problem finds the PERFECT superstring• Good for most ideal casesDisadvantages:

the problem does not deal with errors good only in some ideal cases

• in presence of no errors and known orientation, it fails in presence of repeat

• repeated identical copies get absorbed in the search of the SHORTEST superstring and produces an assembly of uneven coverage

It does not consider lack of coverage and size of the target

NP-hard

Page 14: DNA Fragment Assembly

Reconstruction

Objective:We want to consider errors and unknown orientation

Substring Edit Distance

ds(a,b) = min dss(b)(a,s) one unit is charged for insertion, deletion, substitution no charges for deletion in the extremity of 2nd sequence

Example u=CGATGT v=AACTAATGTGC

_ _ C G A * T G T _ _ A A C T A A T G T G C

ds(u,v) = 2

• A string f is an approximate substring of S at error level (between 0, 1) when ds(f,S) |f|

Page 15: DNA Fragment Assembly

Reconstruction

Input: A collection F of strings, an error tolerance with 0 1

Output: A shortest possible string S | fF, we have

min(ds(f, S), ds(f ,S)) |f|where f is the reverse complement

Advantage: takes into account errors and unknown

orientationDisadvantages:

Is an NP-hard problem It does not model repeats It does not consider lack of coverage and size

of the target

Page 16: DNA Fragment Assembly

Multicontig

Objective: We want to consider internal linkageNo special assumptions except:

for known orientation, fragment and reverse complement are not both present in the collection.

• We want to have good linkage (overlap between fragments) An overlap is a link if it is not (properly)

contained in a bigger fragment The smallest size of a link in a layout is called a

weakest link A layout is a t-contig if its weakest link is at

least size t We partition F into the minimum number of

collections which admit a t-contig

Page 17: DNA Fragment Assembly

Multicontig

Idea: Let's partition F in the minimum number of t-contigs!

Example: F={GTAG, TAATG, TGTAA}

for t=3 F1={TAATG, TGTAA} and F2={GTAG}

for t=2 we have two solutions1. F1={TAATG, TGTAA} and F2={GTAG} 2. F1={TAATG, TGTAA} and F2={GTAG}

for t=1 we have the desired solution (the minimum)F1={TAATG, TGTAA, GTAG}

For errors, we use the consensus of the multi-alignment and insist that the edit distance of the fragments be small

Page 18: DNA Fragment Assembly

MulticontigInput: A collection F of strings, and an integer t0 and an error

tolerance with 0 1Output: A partition of F in the minimum number of subcollections

Ci. 1i k | every Ci admits a t-contig with an -consensus

Advantage: takes into account errors and unknown orientation take into account internal linkage of the fragments

• the answer is formed by several contigsDisadvantages:

Is an NP-hard problem even in the simplest case of no errors and known orientation• It contains as a special case finding a Hamiltonian

path in a restricted class of graphs It has no provision to use information on the

approximate size of the target

Page 19: DNA Fragment Assembly

Overlap Multigrapht-Overlap:

suffix(a,t) = prefix(b,t) or (at)b = a(tb) or

|a|-ta = b|b|-t

Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?

CTAAAG

TACGG

GGACG

GCCC

2 1

Page 20: DNA Fragment Assembly

Overlap Multigrapht-Overlap:

suffix(a,t) = prefix(b,t) or (at)b = a(tb) or

|a|-ta = b|b|-t

CTAAAG

TACGG

GGACAG

GCCC

2 1

Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?

Page 21: DNA Fragment Assembly

Overlap Multigrapht-Overlap:

suffix(a,t) = prefix(b,t) or (at)b = a(tb) or

|a|-ta = b|b|-t

Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?

CTAAAG

TACGG

GGACAG

GCCC

2 1

Page 22: DNA Fragment Assembly

Overlap Multigrapht-Overlap:

suffix(a,t) = prefix(b,t) or (at)b = a(tb) or

|a|-ta = b|b|-t

CTAAAG

TACGG

GGACAG

GCCC

2 1

Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?

Page 23: DNA Fragment Assembly

Overlap Multigrapht-Overlap:

suffix(a,t) = prefix(b,t) or (at)b = a(tb) or

|a|-ta = b|b|-t

CTAAAG

TACGG

GGACAG

GCCC

2 1

Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?

Page 24: DNA Fragment Assembly

Overlap Multigrapht-Overlap:

suffix(a,t) = prefix(b,t) or (at)b = a(tb) or

|a|-ta = b|b|-t

CTAAAG

TACGG

GGACAG

GCCC

2 1

Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?

Page 25: DNA Fragment Assembly

Overlap Multigrapht-Overlap:

suffix(a,t) = prefix(b,t) or (at)b = a(tb) or

|a|-ta = b|b|-t

CTAAAG

TACGG

GGACAG

GCCC

2 1

Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?

Page 26: DNA Fragment Assembly

Overlap Multigrapht-Overlap:

suffix(a,t) = prefix(b,t) or (at)b = a(tb) or

|a|-ta = b|b|-t

CTAAAG

TACGG

GGACAG

GCCC

2 1

Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?

Page 27: DNA Fragment Assembly

Theoretical results

Theorem: the total length of A (set of fragments) is||A|| = w(P) + |S(P)|

where

• ||A||=aA |a|• w(P) is the weight of the path P • |S(P)| is the length of the superstring derived from P.

to convince yourself, read the proof from the book

Other theoretical results: Looking at the shortest common superstring is the same as looking for the Hamiltonian path of maximum weight in a directed multigraph.

Page 28: DNA Fragment Assembly

The Greedy Methodology

NP-hard problems cannot be solved in reasonable time, but we can look for approximate solutions in reasonable time

To apply a greedy methodology:1. the problem must show optimal substructure

– A problem exhibits optimal substructure if the optimal solution to a problem contains within it optimal solutions to other problems

2. the optimal solution is reached by taking the best "local" choice

Page 29: DNA Fragment Assembly

Overlap Graph

An overlap graph has only edges with maximum weight

CTAAAG

TACGA

GACA

ACCC

2 1

The Greedy Algorithminput: weighted di-graph OG(F) with n verticesoutput: Hamiltonian path in OG(F)//Initializefor i 1 to n do

in[i]=0 //how many selected edges enter iout[i]=0 //how many selected edges exit iMakeSet(i)

//ProcessSort the edges by weight, heaviest firstfor each edge (f,g) in this order do //test for acceptance if in[g] = 0 and out[f] = 0 and FindSet(f) ≠FindSet(g)

select (f,g)in[g]1out[f] 1

Union(FindSet(f), Findset(g)) if there is only one component breakreturn selected edges

Page 30: DNA Fragment Assembly

A graph where "greedy" fails

• F={GCAAAG, AGTA,TACGA}

GCAAAG

TACGA

AGTA

2

We order the edges by weight(AGAT, GCAAAG) = 3(GCAAAG, AGTA) =2(AGTA, TACGA) = 2

The algorithm will choose first(AGAT, GCAAAG) = 3 and then is forced to select an edge with weight 0 to complete the path.

Instead the solution should be (GCAAAG, AGTA) =2(AGTA, TACGA) = 2

Page 31: DNA Fragment Assembly

Observations

Local optimal decisions do not always work. Can we do any better?

Use some heuristics.

Issues:ScoringCoverageLinkage

Page 32: DNA Fragment Assembly

Heuristics• Scoring

Uniformity is good, variability is bad. Compute the entropy of a column the entropy is the measure of the chaos in a column. There

are 5 possible characters, A, T, C, G, spaceE=-cpclog pc

E=0 if pc=1 for a character; E=log5 if each pc=1/5 To measure the uniformity we want a low entropy per

column• Coverage

minimun, maximum or medium coverage if the coverage reaches 0 for a column I, we do not have a

connected layout if we have more columns with zero coverage, any

permutation of the intervening regions (the contig) is acceptable

Coverage gives confidence to the consensus Linkage

• High coverage with no links is not good. Overlap is required.

Page 33: DNA Fragment Assembly

More ObservationsLocal optimal decisions do not always work. Can we

do any better?Use some heuristics.

• Assembly in practice consists of:1. Finding overlaps2. Building Layout3. Computing the consensus

Advantages: • We treat each problem separately.

Disadvantages: • It becomes difficult to understand

the relationship between the input and the final output

Page 34: DNA Fragment Assembly

Heuristics

• Finding overlaps use a dynamic programming approach

with a score system such as 1 for matches -1 for mismatches -2 for spaces Do not charge for space after the first

sequence and before the second one.

Page 35: DNA Fragment Assembly

Heuristics

Ordering Fragments there is no algorithm simple and general

enough

Considerations: Use the set DF=F F If f=uv g=wx then g =wx f =

vu if f is approximately the same as the

beginning of g we can expect that whatever is the criterion used to assess the similarity between f and g, the same criterion will apply to their reverse complement

Page 36: DNA Fragment Assembly

Finding overlaps Finding a good ordering of overlapping means

finding a direct path in the overlap graph Both strands are constructed simultaneously Contained fragment are not essential in the path A disconnected graph indicates lack of coverage The presence of cycles indicates repeats Unusual high coverage indicate possible repeats The presence of reverse complement cycles indicates

inverted repeats

Page 37: DNA Fragment Assembly

Alignment and Consensus

• Use the minimal sum of the distancesSuppose we have f g h

CATAGTCTAACTATAGACTATCC

Two semiglobal aligments for f and g are:C A TAG T C_ _ _ C ATA GT C_ _ _

_ _ TAA _ C TA T _ _TA_ A CT A T

C ATA GT C_ _ _

_ _TA _ A CT A T_ _ _A G A CT A T C C CATA GA C T A T C C

ds(f, S) = 1 ds(g, S) = 1 ds(h, S) = 0 if we use the second aligment and ds(f, S) +ds(g, S) +ds(h, S) = 2

ds(f, S) = 1 ds(g, S) = 2 ds(h, S) = 0 if we use the first aligment and A is chosen for column 6, ds(f, S) +ds(g, S) +ds(h, S) = 3

Page 38: DNA Fragment Assembly

A Linked List of Bases

• Sometimes we know what is best only later. Is there a structure that helps us? Use a Linked List of Bases

matches bases are unified in one node unmatched bases are left separate

Technique: Traverse this graph in topological order.

G T C A T A C T A T

A

Page 39: DNA Fragment Assembly

Conclusions• The models fail to address all the issues

involved in the problem• The effective real problem is NP-hard• Approximation gives us some help but fails

in some cases• Heuristics helps and the problem is broken

in 3 smaller problems: 1. finding overlap 2. building layout and 3. computing the consensus

• Are we sure there is nothing else to do? We will look next week at the smaller problem of

comparing only two sequences instead of many. Will we find something better?