A new Approach to Fragment Assembly in DNA Sequenceing

19
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April ,24,2006

description

A new Approach to Fragment Assembly in DNA Sequenceing. Fei wu April ,24,2006. Preface. Introduce the author The background of the paper The history of DNA Sequencing. Traditional DNA Sequencing. DNA. Read 500 – 700 nucleotides at a time from the small fragments (Sanger method) - PowerPoint PPT Presentation

Transcript of A new Approach to Fragment Assembly in DNA Sequenceing

Page 1: A new Approach to Fragment Assembly in DNA Sequenceing

A new Approach to Fragment Assembly in

DNA SequenceingFei wu

April ,24,2006

Page 2: A new Approach to Fragment Assembly in DNA Sequenceing

2

Preface

Introduce the authorIntroduce the author The background of the paperThe background of the paper The history of DNA SequencingThe history of DNA Sequencing

Page 3: A new Approach to Fragment Assembly in DNA Sequenceing

3

Traditional DNA Traditional DNA SequencingSequencing

• Read 500 – 700 nucleotides at a time Read 500 – 700 nucleotides at a time

from the small fragments (Sanger from the small fragments (Sanger

method)method)• Shear DNA into millions of small Shear DNA into millions of small

fragmentsfragments

Shake

DNA

Page 4: A new Approach to Fragment Assembly in DNA Sequenceing

4

Fragment AssemblyFragment Assembly

• Computational ChallengeComputational Challenge: : assemble individual short fragments assemble individual short fragments (reads) into a single genomic (reads) into a single genomic sequence (“super string”) sequence (“super string”)

• Until late 1990s the shotgun Until late 1990s the shotgun fragment assembly of human fragment assembly of human genome was viewed as intractable genome was viewed as intractable problem problem

Page 5: A new Approach to Fragment Assembly in DNA Sequenceing

5

Shortest Superstring Shortest Superstring ProblemProblem

Problem:Problem: Given a set of strings, find a Given a set of strings, find a shortest string that contains all of themshortest string that contains all of them

InputInput: Strings : Strings ss11, s, s22,…., s,…., snn OutputOutput: A string : A string ss that contains all that contains all

strings strings ss11, s, s22,…., s,…., snn as substrings, such that the as substrings, such that the

length of length of ss is minimized is minimized

Complexity:Complexity: NP – complete NP – complete Note:Note: this formulation does not take into this formulation does not take into

account sequencing errorsaccount sequencing errors

Page 6: A new Approach to Fragment Assembly in DNA Sequenceing

6

Reducing SSP to eulerian Reducing SSP to eulerian path problempath problem

Define Define overlap ( soverlap ( sii, s, sj j )) as the length of the longest prefix as the length of the longest prefix of of ssjj that matches a suffix of that matches a suffix of ssii..

aaaggcatcaaatctaaaggcatcaaatctaaaggcatcaaaaaaggcatcaaa aaaaaaggcatcaaatctaaaggcatcaaaggcatcaaatctaaaggcatcaaa aaaggcatcaaaaaaggcatcaaatctaaaggcatcaaatctaaaggcatcaaa

Construct a graph with Construct a graph with nn vertices representing the vertices representing the nn strings strings ss11, s, s22,…., s,…., snn. .

Insert edges of length Insert edges of length overlap ( soverlap ( sii, s, sjj ) ) between vertices between vertices ssii and and ssjj. .

Find the shortest path which visits every vertex exactly Find the shortest path which visits every vertex exactly once. This is the once. This is the Traveling Salesman ProblemTraveling Salesman Problem (TSP), (TSP), which is also NP – complete.which is also NP – complete.

Page 7: A new Approach to Fragment Assembly in DNA Sequenceing

7

Bruijun graphBruijun graph PropertiesPropertiesIf If nn = 1 then the condition for any two vertices = 1 then the condition for any two vertices

forming an edge holds vacuously, and hence all the forming an edge holds vacuously, and hence all the vertices are connected forming a total of vertices are connected forming a total of mm22 edges. edges.

Each vertex has exactly Each vertex has exactly mm incoming and incoming and mm outgoing outgoing edgesedges

Page 8: A new Approach to Fragment Assembly in DNA Sequenceing

8

Sequencing by HybridizationSequencing by Hybridization

Page 9: A new Approach to Fragment Assembly in DNA Sequenceing

9

ll-mer (tulip) composition-mer (tulip) composition Spectrum ( s, l )Spectrum ( s, l ) - - unorderedunordered multiset of multiset of

all possible all possible (n – l(n – l + + 1) 1) ll-mers in a string -mers in a string ss of length of length nn

The order of individual elements in The order of individual elements in Spectrum ( s, l )Spectrum ( s, l ) does not matter does not matter

For For ss = TATGGTGC all of the following are = TATGGTGC all of the following are equivalent representations of equivalent representations of Spectrum ( s, Spectrum ( s, 3 ): 3 ):

{TAT, ATG, TGG, GGT, GTG, TGC}{TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}{TGG, TGC, TAT, GTG, GGT, ATG}

Page 10: A new Approach to Fragment Assembly in DNA Sequenceing

10

SBH: Eulerian Path SBH: Eulerian Path ApproachApproach

SS = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } = { ATG, TGC, GTG, GGC, GCA, GCG, CGT }

Vertices correspond to ( Vertices correspond to ( l l – 1 ) – mers : { AT, – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG }TG, GC, GG, GT, CA, CG }

Edges correspond to Edges correspond to ll – mers from – mers from SS

AT

GT CG

CAGCTG

GG Path visited every EDGE once

Page 11: A new Approach to Fragment Assembly in DNA Sequenceing

11

S S = { AT, TG, GC, GG, GT, CA, CG } = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:corresponds to two different paths:

ATGGCGTGCA ATGCGTGGCA

AT TG GC

GG

GT CGGT CG

CAGCTG

GG

Page 12: A new Approach to Fragment Assembly in DNA Sequenceing

12

Error Correction Or Data Corruption

Euler algorithm sometimes introduces Euler algorithm sometimes introduces errors.errors.

Introduces errors for reducing the Introduces errors for reducing the complexity of the Bruijn graph.complexity of the Bruijn graph.

Reeducation of Bruijn graph eliminate Reeducation of Bruijn graph eliminate false edge.false edge.

For example: N.meningitieds sequencing For example: N.meningitieds sequencing project,orphan elimination corrects project,orphan elimination corrects 234410 errors, and introces 1452 errors.234410 errors, and introces 1452 errors.

Page 13: A new Approach to Fragment Assembly in DNA Sequenceing

13

Observations of the Observations of the EULEREULER

Page 14: A new Approach to Fragment Assembly in DNA Sequenceing

14

Conclusions

Finishing is a bottleneck in large-Finishing is a bottleneck in large-scale DNAscale DNA

EULER has excellent scaling EULER has excellent scaling potential .potential .

The complexity of EULER is mainly The complexity of EULER is mainly defined by the number of tangles defined by the number of tangles rather than the number of rather than the number of repeats/length of the gonomes.repeats/length of the gonomes.

Page 15: A new Approach to Fragment Assembly in DNA Sequenceing

RESULTS AND DISCUSSION

The general performance of SEA on the benchmark

Prediction ambiguity improves alignment quality

Alignment quality versus local structure prediction ambiguity

Page 16: A new Approach to Fragment Assembly in DNA Sequenceing

CONCLUSION

Page 17: A new Approach to Fragment Assembly in DNA Sequenceing

Any Questions?Any Questions?

Page 18: A new Approach to Fragment Assembly in DNA Sequenceing

18

Page 19: A new Approach to Fragment Assembly in DNA Sequenceing

19