Sequence Local Alignment using Directed Acyclic Word Graph

30
Sequence Local Alignment using Directed Acyclic Word Graph Do Huy Hoang

description

Sequence Local Alignment using Directed Acyclic Word Graph. Do Huy Hoang. Sequence Alignment. Sequence Similarity. Alignment Arrange DNA/Protein sequences to show the similarity “” denotes the insertion/deletion event. Other variations. Edit distance Longest common substring - PowerPoint PPT Presentation

Transcript of Sequence Local Alignment using Directed Acyclic Word Graph

Page 1: Sequence Local Alignment using Directed Acyclic Word Graph

Sequence Local Alignment using Directed Acyclic Word Graph

Do Huy Hoang

Page 2: Sequence Local Alignment using Directed Acyclic Word Graph

SEQUENCE ALIGNMENT

Page 3: Sequence Local Alignment using Directed Acyclic Word Graph

Sequence Similarity

• Alignment–Arrange DNA/Protein sequences to show

the similarity• “” denotes the insertion/deletion event

Page 4: Sequence Local Alignment using Directed Acyclic Word Graph

Other variations

• Edit distance• Longest common substring• Affine gap scoring• Using scoring matrix (BLOSUM, PAM)

Page 5: Sequence Local Alignment using Directed Acyclic Word Graph

Alignment score computation

• Needleman–Wunsch – Dynamic programming

Page 6: Sequence Local Alignment using Directed Acyclic Word Graph

Other variationsName Problem Worst time Average time Memory

Four Russian Edit distance 1,0 M*N/log(N) <not good> MN

Ukkonen Global edit (linear cost)

ND N+D2 D2

Waterman Local alignment MN MN MN

Tree tree Local alignment M2N2 <close to M2N2>

BWTSW Meaningful local alignment

MN2 MN0.68

Page 7: Sequence Local Alignment using Directed Acyclic Word Graph

Local alignment

• Local alignment– Find the best alignments of two substring

from the sequences

Page 8: Sequence Local Alignment using Directed Acyclic Word Graph

BWTSW

Page 9: Sequence Local Alignment using Directed Acyclic Word Graph

• BWTSW– Motivation• Scoring 75% similarity• Local alignment table most are zero• Meaningful alignment

– Suffix tree– Meaningful alignment– Meaningful alignment with gap– How good is it?

Page 10: Sequence Local Alignment using Directed Acyclic Word Graph

Meaningful alignment (1)

• Sequences similarity sometimes implies functional similarity.

• Biologists is NOT usually interested in sequences with less than 70% similarity.

• BLAST score– Match = 1– Mismatch = -3– Open Gap = -5– Extending gap = -2

Page 11: Sequence Local Alignment using Directed Acyclic Word Graph

Meaningful alignment (2)

• BLAST score– Match = 1– Mismatch = -3– Open Gap = -5– Extending Gap = -2

– At least 70% match to have none zero score

Page 12: Sequence Local Alignment using Directed Acyclic Word Graph

Meaningful alignment (3)

• BLAST score– Match = 1– Mismatch = -3– Open Gap = -5– Extending Gap = -2

• How many none zero entries in the local alignment DP table?

Page 13: Sequence Local Alignment using Directed Acyclic Word Graph

How to improve?

• Idea:– Not storing zero score entries– Using suffix tree to prune off early

Page 14: Sequence Local Alignment using Directed Acyclic Word Graph

BWTSW details

• FM index for suffix tree representation• Prune zero entries• Store DP vector using linked list

Page 15: Sequence Local Alignment using Directed Acyclic Word Graph

Analysis

• Text length = N• Pattern length = M• Alphabet size =

Page 16: Sequence Local Alignment using Directed Acyclic Word Graph

Average running time (1)

• Let F(L) be the number of pairs of strings length L, which Score(S1,S2) > 0– Sizeof{(S1,S2) : Len(S1)=Len(S2)=L,

Score(S1,S2)>0}– F(L) counts the number of pairs of 75% identity.

• F(L) = sum(i=0..L/4, Binomial(L,i) * (-1)i) • F(L) k1k2

L

• F(log(N)) k3* N0.68

Page 17: Sequence Local Alignment using Directed Acyclic Word Graph

Average running time (2)

• Given S1, Pr(Score(S1,S2) > 0|S1) = F(L)/L

• For M < log(N)– The number of entries are– O(M * F(M)) < O(log(N)*F(log(N))

• For M > log (N)– O(M * N * F(M) / L)

• On average– Time = O(M*F(log(N))) = M * N0.68

Page 18: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG

Page 19: Sequence Local Alignment using Directed Acyclic Word Graph

Possible improvement of BWTSW

• Worst case running time O(N2 M)– When M=N

– O(M N0.68+M3) When M is substring of N• What about ST vs. ST?

Page 20: Sequence Local Alignment using Directed Acyclic Word Graph

• What we used in BWTSW is Suffix Trie (not suffix tree).– #Prove it#

• Suffix trie has O(N2)nodes

• DAWG is a similar structure with O(N) nodes

Page 21: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG (1)

Page 22: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG (2)

• DAWG: Directed Acyclic Word Graph• DAWG is a cyclic automata that recognizes all

the sub-strings of the given string.

Page 23: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG (3)

• Example:– DAWG of “abcbc”

a

b

bc, cab

abc

abcb, bcb, cb

abcbc, bcbc, cbc

a

b c

cb

c

b

b

c

Page 24: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG (4)

• End-set view

0,1, 2,3,4,5

1

2, 4

3, 52

3

4

5

a

b c

cb

c

b

b

c

a

b

bc, cab

abc

abcb, bcb, cb

abcbc, bcbc, cbc

a

b c

cb

c

b

b

c

Page 25: Sequence Local Alignment using Directed Acyclic Word Graph

Trivial DAWG construction

• Using End-set class

0,1, 2,3,4,5

1

2, 4

3, 52

3

4

5

a

b c

cb

c

b

b

c

a

b

bc, cab

abc

abcb, bcb, cb

abcbc, bcbc, cbc

a

b c

cb

c

b

b

c

Page 26: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG properties

• For |w|>2, the Directed Acyclic Word Graph for w has at most 2|w|-1 states, and 3|w|-4 edges

Page 27: Sequence Local Alignment using Directed Acyclic Word Graph

D(w) and ST(wR)• There is a map between nodes in DAWG and implicit

ST(wR)– Example: w=abcbc, wR=cbcba

• Store DAWG using ST, which uses only o(N) bits

a

ab

cb

cbaa

cba

a

b

bc, cab

abc

abcb, bcb, cb

abcbc, bcbc, cbc

a

b c

cb

c

b

b

c

Page 28: Sequence Local Alignment using Directed Acyclic Word Graph

D(w) and ST(wR) (2)list all incoming edges of node q in Dw using ST(w^R)

Page 29: Sequence Local Alignment using Directed Acyclic Word Graph

Local Alignment using DAWG

• Basis

• Induction

Page 30: Sequence Local Alignment using Directed Acyclic Word Graph

Extensions

• Meaningful alignment using DAWG– Prune the nodes whose Score is less than zero

• Shortest path pruning style• Cache log(N) nodes the worst case running

time is M*N*log(N), average case is the same for M << N.