A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

19
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April ,24,2006

description

3 Traditional DNA Sequencing Read 500 – 700 nucleotides at a time from the small fragments (Sanger method)Read 500 – 700 nucleotides at a time from the small fragments (Sanger method) Shear DNA into millions of small fragmentsShear DNA into millions of small fragments Shake DNA

Transcript of A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

Page 1: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

A new Approach to Fragment Assembly in

DNA SequenceingFei wu

April ,24,2006

Page 2: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

2

Preface

Introduce the authorIntroduce the author The background of the paperThe background of the paper The history of DNA SequencingThe history of DNA Sequencing

Page 3: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

3

Traditional DNA Traditional DNA SequencingSequencing

• Read 500 – 700 nucleotides at a time Read 500 – 700 nucleotides at a time from the small fragments (Sanger from the small fragments (Sanger method)method)

• Shear DNA into millions of small Shear DNA into millions of small fragmentsfragments

Shake

DNA

Page 4: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

4

Fragment AssemblyFragment Assembly• Computational ChallengeComputational Challenge: :

assemble individual short fragments assemble individual short fragments (reads) into a single genomic (reads) into a single genomic sequence (“super string”) sequence (“super string”)

• Until late 1990s the shotgun Until late 1990s the shotgun fragment assembly of human fragment assembly of human genome was viewed as intractable genome was viewed as intractable problem problem

Page 5: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

5

Shortest Superstring Shortest Superstring ProblemProblem

Problem:Problem: Given a set of strings, find a Given a set of strings, find a shortest string that contains all of themshortest string that contains all of them

InputInput: Strings : Strings ss11, s, s22,…., s,…., snn OutputOutput: A string : A string ss that contains all that contains all

strings strings ss11, s, s22,…., s,…., snn as substrings, such that the as substrings, such that the

length of length of ss is minimized is minimized

Complexity:Complexity: NP – complete NP – complete Note:Note: this formulation does not take into this formulation does not take into

account sequencing errorsaccount sequencing errors

Page 6: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

6

Reducing SSP to eulerian Reducing SSP to eulerian path problempath problem

Define Define overlap ( soverlap ( sii, s, sj j )) as the length of the longest prefix of as the length of the longest prefix of ssjj that matches a suffix of that matches a suffix of ssii..

aaaggcatcaaatctaaaggcatcaaatctaaaggcatcaaaaaaggcatcaaa aaaaaaggcatcaaatctaaaggcatcaaaggcatcaaatctaaaggcatcaaa aaaggcatcaaaaaaggcatcaaatctaaaggcatcaaatctaaaggcatcaaa

Construct a graph with Construct a graph with nn vertices representing the vertices representing the nn strings strings ss11, s, s22,…., s,…., snn. .

Insert edges of length Insert edges of length overlap ( soverlap ( sii, s, sjj ) ) between vertices between vertices ssii and and ssjj. .

Find the shortest path which visits every vertex exactly Find the shortest path which visits every vertex exactly once. This is the once. This is the Traveling Salesman ProblemTraveling Salesman Problem (TSP), (TSP), which is also NP – complete.which is also NP – complete.

Page 7: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

7

Bruijun graphBruijun graph PropertiesPropertiesIf If nn = 1 then the condition for any two vertices = 1 then the condition for any two vertices

forming an edge holds vacuously, and hence all the forming an edge holds vacuously, and hence all the vertices are connected forming a total of vertices are connected forming a total of mm22 edges. edges.

Each vertex has exactly Each vertex has exactly mm incoming and incoming and mm outgoing outgoing edgesedges

Page 8: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

8

Sequencing by HybridizationSequencing by Hybridization

Page 9: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

9

ll-mer (tulip) composition-mer (tulip) composition Spectrum ( s, l )Spectrum ( s, l ) - - unorderedunordered multiset of all multiset of all

possible possible (n – l(n – l + + 1) 1) ll-mers in a string -mers in a string ss of of length length nn

The order of individual elements in The order of individual elements in Spectrum ( s, l )Spectrum ( s, l ) does not matter does not matter

For For ss = TATGGTGC all of the following are = TATGGTGC all of the following are equivalent representations of equivalent representations of Spectrum ( s, 3 Spectrum ( s, 3 ): ):

{TAT, ATG, TGG, GGT, GTG, TGC}{TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}{TGG, TGC, TAT, GTG, GGT, ATG}

Page 10: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

10

SBH: Eulerian Path SBH: Eulerian Path ApproachApproachSS = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } = { ATG, TGC, GTG, GGC, GCA, GCG, CGT }

Vertices correspond to ( Vertices correspond to ( l l – 1 ) – mers : { AT, – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG }TG, GC, GG, GT, CA, CG }

Edges correspond to Edges correspond to ll – mers from – mers from SS

AT

GT CG

CAGCTG

GG Path visited every EDGE once

Page 11: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

11

S S = { AT, TG, GC, GG, GT, CA, CG } = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:corresponds to two different paths:

ATGGCGTGCA ATGCGTGGCA

AT TG GC

GG

GT CGGT CG

CAGCTG

GG

Page 12: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

12

Error Correction Or Data Corruption

Euler algorithm sometimes introduces Euler algorithm sometimes introduces errors.errors.

Introduces errors for reducing the Introduces errors for reducing the complexity of the Bruijn graph.complexity of the Bruijn graph.

Reeducation of Bruijn graph eliminate Reeducation of Bruijn graph eliminate false edge.false edge.

For example: N.meningitieds sequencing For example: N.meningitieds sequencing project,orphan elimination corrects project,orphan elimination corrects 234410 errors, and introces 1452 errors.234410 errors, and introces 1452 errors.

Page 13: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

13

Observations of the Observations of the EULEREULER

Page 14: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

14

Conclusions Finishing is a bottleneck in large-Finishing is a bottleneck in large-

scale DNAscale DNA EULER has excellent scaling EULER has excellent scaling

potential .potential . The complexity of EULER is mainly The complexity of EULER is mainly

defined by the number of tangles defined by the number of tangles rather than the number of rather than the number of repeats/length of the gonomes.repeats/length of the gonomes.

Page 15: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

RESULTS AND DISCUSSION

The general performance of SEA on the benchmark

Prediction ambiguity improves alignment quality

Alignment quality versus local structure prediction ambiguity

Page 16: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

CONCLUSION

Page 17: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

Any Questions?Any Questions?

Page 18: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

18

Page 19: A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

19