Randomized Algorithms for Motif Finding [1] Ch 12.2
-
Upload
oliver-cameron -
Category
Documents
-
view
27 -
download
2
description
Transcript of Randomized Algorithms for Motif Finding [1] Ch 12.2
![Page 1: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/1.jpg)
Randomized Algorithmsfor Motif Finding
[1] Ch 12.2
![Page 2: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/2.jpg)
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc
l = 8
t=5
s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60 s
DNA
n = 69
![Page 3: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/3.jpg)
Profile P
a G g t a c T tC c A t a c g t
a c g t T A g ta c g t C c A tC c g t a c g G
_________________ A 3 0 1 0 3 1 1 0
C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4
_________________
Consensus a c g t a c g t Score = 3+4+4+5+3+4+3+4=30
l
t
![Page 4: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/4.jpg)
When P is Given
P by count:
A 4 7 3 0 1 0
C 1 0 4 5 3 0
T 1 1 0 0 2 7
G 2 0 1 3 4 1
A 4/8 7/8 3/8 0 1/8 0
C 1/8 0 4/8 5/8 3/8 0
T 1/8 1/8 0 0 2/8 7/8
G 2/8 0 1/8 3/8 2/8 1/8
P by probability:
Score: 4+7+4+5+3+7=30
![Page 5: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/5.jpg)
Scoring a String with P
A 4/8 7/8 3/8 0 1/8 0
C 1/8 0 4/8 5/8 3/8 0
T 1/8 1/8 0 0 2/8 7/8
G 2/8 0 1/8 3/8 2/8 1/8
The probability of a possible consensus string:
Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = .033646
Prob(atacag|P) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = .001602
![Page 6: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/6.jpg)
P-Most Probable l-mer
A 4/8 7/8 3/8 0 1/8 0
C 1/8 0 4/8 5/8 3/8 0
T 1/8 1/8 0 0 2/8 7/8
G 2/8 0 1/8 3/8 2/8 1/8
P =
Given P and a sequence ctataaaccttacatc, find the P-most probable l-mer
![Page 7: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/7.jpg)
Third try: c t a t a a a c c t t a c a t c
Second try: c t a t a a a c c t t a c a t c
First try: c t a t a a a c c t t a c a t c
P-Most Probable l-merA 4/8 7/8 3/8 0 1/8 0
C 1/8 0 4/8 5/8 3/8 0
T 1/8 1/8 0 0 4/8 7/8
G 2/8 0 1/8 3/8 4/8 1/8
Find the Prob(a|P) of every possible 6-mer:
-Continue this process to evaluate every possible 6-mer
![Page 8: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/8.jpg)
P-Most Probable l-mer
String, Highlighted in Red Calculations Prob(a|P)
ctataaaccttacat 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0
ctataaaccttacat 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0
ctataaaccttacat 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0
ctataaaccttacat 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0
ctataaaccttacat 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336
ctataaaccttacat 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8 .0299
ctataaaccttacat 1/2 x 0 x 1/2 x 0 1/4 x 0 0
ctataaaccttacat 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0
ctataaaccttacat 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0
ctataaaccttacat 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8 .0004
![Page 9: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/9.jpg)
P-Most Probable l-mers in All Sequences• Find the P-most probable l-
mer in each of the sequences. ctataaacgttacatc
atagcgattcgactg
cagcccagaaccct
cggtataccttacatc
tgcattcaatagctta
tatcctttccactcac
ctccaaatcctttaca
ggtcatcctttatcct
A 4/8 7/8 3/8 0 1/8 0
C 1/8 0 4/8 5/8 3/8 0
T 1/8 1/8 0 0 2/8 7/8
G 2/8 0 1/8 3/8 2/8 1/8
P=
![Page 10: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/10.jpg)
New Profile
ctataaacgttacat
atagcgattcgactg
cagcccagaaccctg
cggtgaaccttacat
tgcattcaatagctt
tgtcctttccactca
ctccaaatcctttac
ggtctacctttatcc
P-Most Probable l-mers form a new profile
1 a a a c g t
2 a t a g c g
3 a a c c c t
4 g a a c c t
5 a t a g c t
6 g a c c t T
7 a t c c t T
8 t a c c t T
A 5/8 5/8 4/8 0 0 0
C 0 0 4/8 6/8 4/8 0
T 1/8 3/8 0 0 3/8 7/8
G 2/8 0 0 2/8 1/8 1/8
![Page 11: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/11.jpg)
Compare New and Old Profiles1 a a a c g t
2 a t a g c g
3 a a c c c t
4 g a a c c t
5 a t a g c t
6 g a c c t T
7 a t c c t T
8 t a c c t T
A 5/8 5/8 4/8 0 0 0
C 0 0 4/8 6/8 4/8 0
T 1/8 3/8 0 0 3/8 7/8
G 4/8 0 0 2/8 1/8 1/8
A 4/8 7/8 3/8 0 1/8 0
C 1/8 0 4/8 5/8 3/8 0
T 1/8 1/8 0 0 2/8 7/8
G 2/8 0 1/8 3/8 2/8 1/8
New Score: 5+5+4+6+4+7=31 Old Score: 4+7+4+5+3+7=30
![Page 12: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/12.jpg)
Greedy Profile Motif Search
1. Select random starting positions s
2. Create profile P
3. Compute Score
4. Find P-most probable l-mers
5. Compute new profile based on new starting positions s
6. Compute Score
7. Repeat step 4-5 as long as Score increases
Cost: Polynomial for each round, and number of round can be controlled
![Page 13: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/13.jpg)
Performance• Since we choose starting positions randomly, there is little
chance that our guess will be close to an optimal motif, meaning it will take a very long time to find the optimal motif.
• It is unlikely that the random starting positions will lead us to the correct solution at all.
• In practice, this algorithm is run many times with the hope that random starting positions will be close to the optimum solution simply by chance.
![Page 14: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/14.jpg)
Gibbs Sampling
• Gibbs Sampling improves on one l-mer at a time
• Thus proceeds more slowly and cautiously
![Page 15: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/15.jpg)
Gibbs Sampling: an Example
Input:
t = 5 sequences, motif length l = 8
1. GTAAACAATATTTATAGC2. AAAATTTACCTCGCAAGG
3. CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA
5. TACTTAACACCCTGTCAA
![Page 16: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/16.jpg)
Gibbs Sampling: an Example
1) Randomly choose starting positions, s=(s1,s2,s3,s4,s5) in the 5 sequences:
s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
![Page 17: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/17.jpg)
Gibbs Sampling: an Example
2) Choose one of the sequences at random:Sequence 2: AAAATTTACCTTAGAAGG
s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
![Page 18: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/18.jpg)
Gibbs Sampling: an Example
2) Choose one of the sequences at random:
Sequence 2: AAAATTTACCTTAGAAGG
s1=7 GTAAACAATATTTATAGC
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
![Page 19: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/19.jpg)
Gibbs Sampling: an Example3) Create profile P from l-mers in remaining 4 sequences:
1 A A T A T T T A
3 T C A A G C G T
4 G T A A A C G A
5 T A C T T A A C
A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4
C 0 1/4 1/4 0 0 2/4 0 1/4
T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4
G 1/4 0 0 0 1/4 0 3/4 0Consensus
StringT A A A T C G A
![Page 20: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/20.jpg)
Gibbs Sampling: an Example
4) Calculate the prob(a|P) for every possible 8-mer in the removed sequence
AAAATTTACCTTAGAAGG .000732AAAATTTACCTTAGAAGG .000122AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG .000183AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0AAAATTTACCTTAGAAGG 0
![Page 21: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/21.jpg)
Gibbs Sampling: an Example5) Create a distribution of probabilities of l-mers prob(a|P), and randomly select a new starting position based on this distribution.
Starting Position 1: prob( AAAATTTA | P ) = .000732 / .000122 = 6
Starting Position 2: prob( AAATTTAC | P ) = .000122 / .000122 = 1
Starting Position 8: prob( ACCTTAGA | P ) = .000183 / .000122 = 1.5
a) To create this distribution, divide each probability prob(a|P) by the lowest probability:
Ratio = 6 : 1 : 1.5
![Page 22: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/22.jpg)
Turning Ratios into Probabilities
Probability (Selecting Starting Position 1): 6/(6+1+1.5)= 0.706
Probability (Selecting Starting Position 2): 1/(6+1+1.5)= 0.118
Probability (Selecting Starting Position 8): 1.5/(6+1+1.5)=0.176
b) Define probabilities of starting positions according to computed ratios
![Page 23: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/23.jpg)
Gibbs Sampling: an Example
c) Select the start position according to computed ratios:
P(selecting starting position 1): .706
P(selecting starting position 2): .118
P(selecting starting position 8): .176
![Page 24: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/24.jpg)
Gibbs Sampling: an Example
Assume we select the substring with the highest probability – then we are left with the following new substrings and starting positions.
s1=7 GTAAACAATATTTATAGC
s2=1 AAAATTTACCTCGCAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=5 TGAGTAATCGACGTCCCA
s5=1 TACTTCACACCCTGTCAA
Earlier
s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
![Page 25: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/25.jpg)
Gibbs Sampling: an Example
6) iterate the procedure again with the above starting positions until we can not improve the score any more.
![Page 26: Randomized Algorithms for Motif Finding [1] Ch 12.2](https://reader036.fdocuments.in/reader036/viewer/2022062517/568133e4550346895d9ad754/html5/thumbnails/26.jpg)
How Gibbs Sampling Works
1. Randomly choose starting positions s = (s1,...,st) 2. Randomly choose one of the t sequences3. Create a profile P from the other t -1 sequences4. For each position in the removed sequence, calculate the
probability that the l-mer starting at that position was generated by P.
5. Choose a new starting position for the removed sequence at random based on the probabilities calculated in step 4.
6. Repeat steps 2-5 until there is no improvement