Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer...
-
date post
15-Jan-2016 -
Category
Documents
-
view
216 -
download
0
Transcript of Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer...
![Page 1: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/1.jpg)
Identification of Distinguishing Motifs
Zhanyong WANG(Master Degree Student)
Dept. of Computer Science, City University of Hong KongE-mail: [email protected]
Joint work with WangSen FENG and Lusheng WANG
![Page 2: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/2.jpg)
Outline
• The Definitions of Problems• Applications• Previous work• Our work• Algorithm for Single Group• Algorithm for Two Groups• Simulation Results for Single Group• Simulation Results for Two Groups
![Page 3: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/3.jpg)
Motif Identification
• Two versions
1. Single Group
2. Two Groups
![Page 4: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/4.jpg)
Single Group
• Instance: a group of n sequences.
• Objective: find a length-L motif that appears in each of the given sequences and those occurrences of the motif are similar
![Page 5: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/5.jpg)
Two Groups
• Instance: two groups of sequences:
B (Bad) and G (Good)
• Objective: find a motif of length-L that appears in every sequence in group B and does not appear in anywhere of the sequences in G
the occurrences of the motif have errors
![Page 6: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/6.jpg)
Applications
1. Finding Targets for Potential Drugs
(T. Jiang, C. Trendall, S, Wang, T. Wareham, X. Zhang, 98) (K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999)
-- bad strings in B are from Bacteria. -- good strings in G are from Humans
-- find a substring s of length L that is conserved in all bad strings, but not conserved in good strings.
-- use s to screen chemicals -- those selected chemicals can then be tested as potential broad-range antibiotics.
![Page 7: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/7.jpg)
Applications
2. Creating Diagnostic Probes for Bacterial Infection
(T. Brown, G.A. Leonard, E.D. Booth, G. Kneale, 1990)
-- a group of closely related pathogenic bacteria
-- find a substring that occurs in each of the bacterial sequences (with as few substitutions as possible) and does not occur in the human sequences
![Page 8: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/8.jpg)
Applications
3. Locating binding sites and regulatory signals
4. Creating Universal PCR Primers
5. Creating Unbiased Consensus Sequences
6. Anti-sense Drug Design
![Page 9: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/9.jpg)
Previous work
• The closest substring problem was proved to be NP-hard. So are the single group and two groups
(K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999)
• Polynomial time approximation schemes -theoretical results
-speed is slow in order to solve practical instances
![Page 10: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/10.jpg)
Previous Programs
• Bailey and Elkan: MEME (1994) uses a modified EM algorithm, allows the motif
to be absent in some of the given sequences • Waterman: Extended sample-driven approach (1984)• Keich and Pavel Pevzner: two programs (2002)• Buhler and Tompa : Projection (2002)
combine EM and random projection• Price, Ramabhadran and Pevzner: PatternBranching uses branching from sample strings (2003)
faster than the previously best known program: projection
![Page 11: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/11.jpg)
Previous Programs (continued)
• Do not allow indels
• Only for the one group problem
• Some algorithms can handle one gap
![Page 12: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/12.jpg)
Our work
• An extension of the EM approach
• A randomized algorithm for the single group problem which can handle indels
• We give an algorithm for the two groups problem
![Page 13: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/13.jpg)
Representation of motifs• Consensus pattern: choosing the letter that appears the most in each
of the L columns (Figure a)• Profile: 4×L matrix W (ACGT), each cell W(i,j) is a number indicating th
e occurrence rate of letter i in column j.(Figure b)
• Use the profile representation in the early stage of the EM algorithm• Use the consensus pattern representation to improve the accuracy
caaccca caacccc catcccg catccct cacccca
--------------------consensus pattern caacccaAnother con. Pattern catccca (a)
A 0 1 0.4 0 0 0 0.4
C 1 0 0.2 1 1 1 0.2
G 0 0 0.0 0 0 0 0.2
T 0 0 0.4 0 0 0 0.2 (b)
![Page 14: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/14.jpg)
Computing the single group problem
The EM (Expectation Maximization) Algorithm(Wang,L. Dong,L. and Fan,H. 2004)
Input:– n sequences S1,S2,...,Sn
– a 4L matrix W (the initial guess of the motif)
Output:– new matrix W that is a local maximal solution
A 0.25 0.0 1.0
C 0.25 1.0 0.0
G 0.25 0.0 0.0
T 0.25 0.0 0.0
![Page 15: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/15.jpg)
Step 1: L-mer: Sij, a length-L substringFor each L-mer Sij, calculate the likelihood that Sij is theoccurrence of the motif:
P(i,j)=x=1 to L W(Sij(x),x)To avoid zero weights, a fixed small number is added to W(i,j) (0.1)
Step 2: Normalize the likelihood:
P'(i, j)=P(i,j) / x=1m-L+1
P(i, x)
s. t. j=1 to m-L+1P'(i,j)=1
Sij= c a a
W=a 0.25 0 1 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0
P(i,j): 0.25*0.1*1=0.025
![Page 16: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/16.jpg)
Step 3: Re-estimate the motif matrix W.
W= i=1 n j=1
m-L+1 Wij
Where Wij is constructed from Sij
Sij= c a a
W=a 0.25 0 1 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0
P(i,j): 0.25*0.1*1=0.025
Sij(1) Sij(2) Sij(3) Sij = c a a
Wij= a 0 0.025 0.025 c 0.025 0 0 g 0 0 0 t s 0 0 0
![Page 17: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/17.jpg)
Step 4
Normalize W
W'(b,x)= W(b,x)/b=A,C,G,TW(b,x)
Replace W with W'
![Page 18: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/18.jpg)
Step 5
Steps 1 to 4 is called a cycle. If W changes very little from last cycle, then
EM converges and the algorithm ends. otherwise, goto step 1 and start next cycle
Determine the amount of change:
max|Wq(b,x)-Wq-1(b,x)|< set =0.05 such that the algorithm stops within few
cycles
![Page 19: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/19.jpg)
Our Algorithm For Single Group(with indels)
General frame is the same as the previous algorithm
1. We get a initial guess of the motif W
2. With W as initial value, use the new EM algorithm to update W
3. Repeat 1–2 several (Maxtrials) times and choose the best result.
![Page 20: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/20.jpg)
Incorporating Indels
• We add the “space” as a letter, so the matrix for EM algorithm became 5×L
• K: the maximum total number of indels
• For each starting position, consider all length L+h substrings, h=0,1,-1,…,k,-k is the number of indels.
• For each length L+h substring, align it with the matrix
![Page 21: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/21.jpg)
Align a length L+h string with a 5×L matrix
• Dynamic programming• similar to pair wise string alignment• d[i, j] is the score of aligning the first i columns in the ma
trix with the first j letters in the string
d[i, j]=max{d[i-1, j-1] ×W[x,i],
d[i-1,j] ×w[ ,i],△ d[i, j-1] ×e}
Buttom-up order: d[L, L+h]
Best alignment (with indel)
![Page 22: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/22.jpg)
Continued
After calculated the motif W (profile representation: matrix) , we use the matrix W to find the occurrence of the motif in each sequence
![Page 23: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/23.jpg)
Find the motif occurrences
• find the occurrence of the motif in each string
∑i=1LW(ai,i)
a1a2a3…aL is a length-L substring (L-mer) and W is the matrix for the motif
![Page 24: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/24.jpg)
Algorithm for the two Groups (no indels)
• We follow the basic steps of EM method
• Modify the formula to re-construct W
• Re-estimate the matrix W from both group B and G
![Page 25: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/25.jpg)
Main idea
When the motif represented by the matrix W is too close to some L-mers from group G (p(i,j)>ave), we scoop the pattern from the matrix by subtracting the corresponding matrix Wij
![Page 26: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/26.jpg)
Experiment Results (Single Group)
• Input: (1) randomly generate sequences
n = 20m= 600
(2) insert motif into the sequences Center string s (length L) Mutate d positions (insertion, deletion, mutation) Implant the mutated copy into the sequences
• Output:Use our program to find the implanted pattern.
![Page 27: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/27.jpg)
Experiment Results (Single Group)
Table 1: 15 sequences: no indel 5 sequences: one deletion
Table 2:10 sequences: no indel5 sequences : one deletion 5 sequences : one insertion
In table 2, the running time increases significantly and accuracy in many cases is slightly worse than that in Table 1
![Page 28: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/28.jpg)
Experiment Results (Single Group)
•Table 3:5 sequences : one deletion5 sequences : two deletions10 sequences: no indel
•Table 4:5 sequences : one insertion5 sequences : two insertions10 sequences: no indel
The results in Table 4 are slightly better than those in Table 3. The reason might be that the case in Table 4 needs to insert two columns in the matrix for the motif, whereas the case in Table 3 needs to insert two spaces in the motif sequences
![Page 29: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/29.jpg)
Experiment Results (Single Group)
•Table 5, the mixed case:
Probability:
one insertion : 1/8 one deletion : 1/8
two insertions : 1/8 two deletions: 1/8
one insertion and one deletion: 1/8
no indel: 3/8
![Page 30: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/30.jpg)
Experiment Results (Two Groups)
• Center (m=600):
c1: the center for group B, random sequence
c2: the center for group G, randomly mutate
200 positions from c1
• Generate two groups
n=10
Randomly mutate 200 positions from the center
![Page 31: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/31.jpg)
Experiment Results (Two Groups)
From Table 6, we can see that it is easy to find a motif that can distinguish the two groups when L is large
Compare Table 7 with Table 6, we can see that it is easy to find a distinguishing motif when the distance between the two centers is large
Table 7 shows the results when the average Hamming distance between c1 and c2 is about 175
Table 6 shows the results when the average Hamming distance between c1 and c2 is about 128
![Page 32: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d495503460f94a25e8c/html5/thumbnails/32.jpg)
Summary
• An algorithm for the single group problem that can handle indels
• An algorithm for the two groups problem