Design of Optimal Multiple Spaced Seeds for Homology Search
Jinbo XuSchool of Computer Science, University
of WaterlooJoint work with D. Brown, M. Li and B.
Ma
Overview
Seed-based homology search Optimal multiple spaced seeds LP based randomized algorithm Experimental results Future work
Homology Search
Exhaustive search algorithm e.g. Smith-Waterman algorithm 100% sensitivity infeasible if the database is large
Suffix tree Seed-based algorithm, e.g. BLAST,
PatternHunter
Given: database of DNA sequences, query sequence QTask: extract all homologous sequences of Q from the database.
Region and Seed
S1: AGCTTGCCGTAAACCGS2: ACGTAGCACTGAGCTGRegion model: 1001011001010101
seed: 10010010111: a required match0: “don’t care”seed length M: length of the stringseed weight W: the number of 1 in the seed
Seed-based Hit
ACGCGTGGGAAACC
CAATGTGGGCAATT11011011
00001111101100
seed
region
Given a seed, a query sequence hits another sequence ifand only if the seed hits a region model of both sequences.
Query:
A seed S hits a region R at position i if and only ifR[i+j]=1 for every position j where s[j]=1
Single Seed Based Algorithm
Query: GGAAGCTTGCCGTATGCCATAGS1: CCAGGCTAGCCATAGGCCTTCT
Seed:101110111011011101
Length=18, weight=13
Query: GGAAGCTTGCCGTATGCCATAGS2: CCAGGCATGCAGTAGGCCTTCT
S1 hit, but S2 missed.
Multiple Seeds Based Algorithm
Query: GGAAGCTTGCCGTATGCCATAGS1: CCAGGCTAGCCATAGGCCTTCT
seed1:101110111011011101
Length=18, weight=13
Query: GGAAGCTTGCCGTATGCCATAGS2: CCAGGCATGCAGTAGGCCTTCT
seed2:101101110111011101
Both S1 and S2 are hit
Optimal Multiple Seeds (OMS) Problem
Given: random region R under certain distribution, two integers M and W, and an integer k.Find: set of k seeds with weight W and length no more than M to maximize the hit probability of R.
Mandala (J. Buhler et al.) Hill Climbing, good for small k, no result
reported for k>4 Greedy + Monte Carlo sampling
Greedy Algorithm (M. Li and B. Ma et al.) Given i seeds (i=1,2,…,k-1), search for the
next seed by maximizing the incremental sensitivity
Vector Seeds (B. Brejova et al.)
Related Work
Seed Specific OMS problem: Given a random region R, a set of m seeds , and an integer k, find a set of k seeds out of , to maximize the hit probability of R.
Seed-Region Specific OMS problem: Given a set of m seeds , an integer k and a set of
regions , find a set of k seeds, to maximize the hits of .
Variants of OMS
msss ,...,, 21
msss ,...,, 21
msss ,...,, 21
NRRR ,...,, 21
NRRR ,...,, 21
Main Results:1. Approximation ratio by a greedy algorithm
(D.S. Hochbaum)2. Same approximation ratio by linear programming based
randomized algorithm 3. is tight unless P=NP (U. Feige)
Given a ground set H and its subsets and an integer k, Find k sets out of to cover H as much as possible.
Maximum Coverage (MC) problem
632.01 1 e
mHHH ,...,, 21
11 e
mHHH ,...,, 21
OMS vs. MC ProblemOMS
Region Sampling
Seed Specific OMS
Seed-Region Specific OMS=MC Problem
Seed Enumeration
iS seedby hit regions ofset the:iH
Region Model
3-bits
000 001 010 011 100 101 110 111
p .1426
.0573
.1236
.0660
.0710
.0364
.2335
.2696
1. PH: length 64 and each bit independently set to 1 with probability 0.7 (B. Ma et al.)
2. M3: length 64 and each bit independently set to 1 with probability 0.8 if i%3=1 or 2, 0.5 otherwise (B. Brejova et al.)
3. M8: length 63 and each codon satisfy a certain distribution (B. Brejova et al.)
4. HMM: average length 90, two adjacent codons are not independent (B. Brejova et al.)
Observations
1. PH model: the hit probability of any seed with weight 11 and length 18 is at least 0.30
2. M3 model: the hit probability of any seed with weight 11 and length 18 is at least 0.27
3. HMM model: the hit probability of any seed with weight 11 and length 18 is at least 0.70
Variant of MC Problem
possible. asmuch as cover to
,...,, ofout sets find elements, ||least at contains
whichofeach ,,...,, subsets its and set ground aGiven
21
21
H
HHHkH
HHHH
m
m
Can we have a better approximation ratio?
If the sensitivity of each seed is at least and the optimal linear solution is , then the LP based randomized algorithm guarantees to generate a solution with approximation ratio at least
for the seed-region specific OMS problem.
Better Approximation Ratio
)1)(1()1( 1
)1()1( *
*
*
*
ee
lklkk
lklk
*l
Practical Approximation Ratio
)11( 02146.0
)(
)(
)(**
l
Ap
AP
Ap
the optimal seed set for the random region R the best seed set found by the LP based algorithm
*AA
with probability 0.99
If 5000 regions are sampled, then we have
Test Data
All-against-all comparison between mouse EST sequences and human EST sequences by Smith-Waterman algorithm
3346700 pairs found with local alignment score no less than 16
score
16-20
20-30
30-40
40-50
50-60
60-70
70-80
80-90
>90
ratio 93 6.3 0.26 0.06 0.06 0.02 0.01 0.02 0.28
label
1 2 3 4 5 6 7 8 9
Top Related