Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short...
-
Upload
loren-rice -
Category
Documents
-
view
213 -
download
0
Transcript of Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short...
Bioinformatics PhD. Course
Summary (approximate)
• 1. Biological introduction• 2. Comparison of short sequences (<10.000 bps)
• 4 Sequence assembly
• 3 Comparison of large sequences (up to 250 000 000)
• 5 Efficient data search structures and algorithms
• 6 Proteins...
2. Comparison of short sequences (<10.000 bps)
Summary (more or less)
• 2.1 Dot matrix• 2.2 Pairwise alignment. • 2.3 Hash algorithms.• 2.4 Multiple alignment.
2. Dot matrix
Given two sequences, how we can analyse their degree of identity?
By searching those parts that match:
S1
S2
x
y
1/0
1 if both characters coincide
2. Dot matrix
Given two sequences, how we can analyse their degree of identity?
By searching those parts that match:
S1
S2
x
y
S1
S2
x..
y . . . . .
1/0
1 if both characters coincide ?
2.1 Dot matrix
What is the cost of the algorithm?
When are the matchings relevant?
accaccacaccacaacgagcata … acctgagcgatat
acc..t
L=window length
• m(i,j)=1 iff S1(i..i+L)=S2(j..j+L): exact matching
• m(i,j)=1 iff k over L coincide: approximate matching.
• m(i,j)=k iff k over L coincide: approximate matching
2.1. Dot matrix: algorithm cost
accaccacaccacaacgagcata … acctgagcgatat
acc..t
• long(S1)*long(S2)* L in other words O(n2 L)
• can long(S1)*long(S2) be possible? can we also say that O(n2 ) is independent of L?
2.1. Dot matrix: signals
A: transposons C: Random B: S1=S2
When are signals statistically significant?
2.1. Dot matrix: statistical significance:
We need to define a random model against which to compare the signals:
we define RV: X number of characters that coincide,
then Prob(X=k)=comb(L,k) pk (1-p)L-k
Given
x..
y . . . . .
S1
S2
L=window length
What is its expected value?