A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

Post on 18-Jan-2018

217 views 0 download

description

3 Traditional DNA Sequencing Read 500 – 700 nucleotides at a time from the small fragments (Sanger method)Read 500 – 700 nucleotides at a time from the small fragments (Sanger method) Shear DNA into millions of small fragmentsShear DNA into millions of small fragments Shake DNA

Transcript of A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.

A new Approach to Fragment Assembly in

DNA SequenceingFei wu

April ,24,2006

2

Preface

Introduce the authorIntroduce the author The background of the paperThe background of the paper The history of DNA SequencingThe history of DNA Sequencing

3

Traditional DNA Traditional DNA SequencingSequencing

• Read 500 – 700 nucleotides at a time Read 500 – 700 nucleotides at a time from the small fragments (Sanger from the small fragments (Sanger method)method)

• Shear DNA into millions of small Shear DNA into millions of small fragmentsfragments

Shake

DNA

4

Fragment AssemblyFragment Assembly• Computational ChallengeComputational Challenge: :

assemble individual short fragments assemble individual short fragments (reads) into a single genomic (reads) into a single genomic sequence (“super string”) sequence (“super string”)

• Until late 1990s the shotgun Until late 1990s the shotgun fragment assembly of human fragment assembly of human genome was viewed as intractable genome was viewed as intractable problem problem

5

Shortest Superstring Shortest Superstring ProblemProblem

Problem:Problem: Given a set of strings, find a Given a set of strings, find a shortest string that contains all of themshortest string that contains all of them

InputInput: Strings : Strings ss11, s, s22,…., s,…., snn OutputOutput: A string : A string ss that contains all that contains all

strings strings ss11, s, s22,…., s,…., snn as substrings, such that the as substrings, such that the

length of length of ss is minimized is minimized

Complexity:Complexity: NP – complete NP – complete Note:Note: this formulation does not take into this formulation does not take into

account sequencing errorsaccount sequencing errors

6

Reducing SSP to eulerian Reducing SSP to eulerian path problempath problem

Define Define overlap ( soverlap ( sii, s, sj j )) as the length of the longest prefix of as the length of the longest prefix of ssjj that matches a suffix of that matches a suffix of ssii..

aaaggcatcaaatctaaaggcatcaaatctaaaggcatcaaaaaaggcatcaaa aaaaaaggcatcaaatctaaaggcatcaaaggcatcaaatctaaaggcatcaaa aaaggcatcaaaaaaggcatcaaatctaaaggcatcaaatctaaaggcatcaaa

Construct a graph with Construct a graph with nn vertices representing the vertices representing the nn strings strings ss11, s, s22,…., s,…., snn. .

Insert edges of length Insert edges of length overlap ( soverlap ( sii, s, sjj ) ) between vertices between vertices ssii and and ssjj. .

Find the shortest path which visits every vertex exactly Find the shortest path which visits every vertex exactly once. This is the once. This is the Traveling Salesman ProblemTraveling Salesman Problem (TSP), (TSP), which is also NP – complete.which is also NP – complete.

7

Bruijun graphBruijun graph PropertiesPropertiesIf If nn = 1 then the condition for any two vertices = 1 then the condition for any two vertices

forming an edge holds vacuously, and hence all the forming an edge holds vacuously, and hence all the vertices are connected forming a total of vertices are connected forming a total of mm22 edges. edges.

Each vertex has exactly Each vertex has exactly mm incoming and incoming and mm outgoing outgoing edgesedges

8

Sequencing by HybridizationSequencing by Hybridization

9

ll-mer (tulip) composition-mer (tulip) composition Spectrum ( s, l )Spectrum ( s, l ) - - unorderedunordered multiset of all multiset of all

possible possible (n – l(n – l + + 1) 1) ll-mers in a string -mers in a string ss of of length length nn

The order of individual elements in The order of individual elements in Spectrum ( s, l )Spectrum ( s, l ) does not matter does not matter

For For ss = TATGGTGC all of the following are = TATGGTGC all of the following are equivalent representations of equivalent representations of Spectrum ( s, 3 Spectrum ( s, 3 ): ):

{TAT, ATG, TGG, GGT, GTG, TGC}{TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}{TGG, TGC, TAT, GTG, GGT, ATG}

10

SBH: Eulerian Path SBH: Eulerian Path ApproachApproachSS = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } = { ATG, TGC, GTG, GGC, GCA, GCG, CGT }

Vertices correspond to ( Vertices correspond to ( l l – 1 ) – mers : { AT, – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG }TG, GC, GG, GT, CA, CG }

Edges correspond to Edges correspond to ll – mers from – mers from SS

AT

GT CG

CAGCTG

GG Path visited every EDGE once

11

S S = { AT, TG, GC, GG, GT, CA, CG } = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:corresponds to two different paths:

ATGGCGTGCA ATGCGTGGCA

AT TG GC

GG

GT CGGT CG

CAGCTG

GG

12

Error Correction Or Data Corruption

Euler algorithm sometimes introduces Euler algorithm sometimes introduces errors.errors.

Introduces errors for reducing the Introduces errors for reducing the complexity of the Bruijn graph.complexity of the Bruijn graph.

Reeducation of Bruijn graph eliminate Reeducation of Bruijn graph eliminate false edge.false edge.

For example: N.meningitieds sequencing For example: N.meningitieds sequencing project,orphan elimination corrects project,orphan elimination corrects 234410 errors, and introces 1452 errors.234410 errors, and introces 1452 errors.

13

Observations of the Observations of the EULEREULER

14

Conclusions Finishing is a bottleneck in large-Finishing is a bottleneck in large-

scale DNAscale DNA EULER has excellent scaling EULER has excellent scaling

potential .potential . The complexity of EULER is mainly The complexity of EULER is mainly

defined by the number of tangles defined by the number of tangles rather than the number of rather than the number of repeats/length of the gonomes.repeats/length of the gonomes.

RESULTS AND DISCUSSION

The general performance of SEA on the benchmark

Prediction ambiguity improves alignment quality

Alignment quality versus local structure prediction ambiguity

CONCLUSION

Any Questions?Any Questions?

18

19