Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Transcript of Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia...
![Page 1: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/1.jpg)
Sequential Data Mining
Jianlin ChengDepartment of Computer ScienceUniversity of Missouri, Columbia
2011
![Page 2: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/2.jpg)
Sequential Data
• Sequence data: a series of alphabets or numbers• Spatial sequence data: biological sequence, text
document, a computer program, web page, a missile trajectory
• Temporal sequence data: speech, movie clips, stock prices, a series of web clicks
![Page 3: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/3.jpg)
Key Issue
• Measure the similarity between two sequences, i.e. sequence comparison
• Sequence database search / retrieval• Sequence classification• Sequence clustering• Sequence pattern recognition• Sequence similarity definition and calculation
is a challenge compared to vector-based data
![Page 4: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/4.jpg)
Sequence Alignment
• Use biological sequences as an example• Optimal pairwise sequence alignment algorithm• Fast, heuristic sequence alignment (BLAST)• Statistical significance of sequence alignment
score
![Page 5: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/5.jpg)
Sequence Data
• DNA Sequence (gene) gaagagagcttcaggtttggggaagagaca
acaactcccgctcagaagcaggggccgata • RNA Sequence
caaaucacuacacacaggguagaagguggaacgcacaggagcaugucaacggggugc
• Protein Sequence RELQVWGRDNNSLSEAGANRQGDVSFNLPQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEDIDLPGKWKP
![Page 6: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/6.jpg)
Global Pairwise Sequence Alignment ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL
ITAKPQWLKTSESVTFLSFLLPQTQGLYHL
![Page 7: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/7.jpg)
Global Pairwise Sequence Alignment
ITAKPAKT-TSPKEQAIGLSVTFLSFLLPAG-VLYHL
ITAKPQWLKTSE-------SVTFLSFLLPQTQGLYHL
Alignment (similarity) score
ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL
ITAKPQWLKTSESVTFLSFLLPQTQGLYHL
![Page 8: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/8.jpg)
Three Main Issues
1.Definition of alignment score2.Algorithms of finding the optimal alignment3.Evaluation of significance of alignment score
![Page 9: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/9.jpg)
A Simple Scoring Scheme• Score of a character pair: S(match)=1, S(not_match) = -1, S(gap-char) = -1
• Score of an alignment = n
iS1
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL
ITAKPQWLKSTE-------SVTFLSFLLPQTQGLYHL
5 – 7 – 7 +10 -4 + 4= 1
![Page 10: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/10.jpg)
Optimization Problem
• How many possible alignments exist for two sequences with length m and n?
• How can we find the best alignment to maximize alignment score?
![Page 11: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/11.jpg)
Possible AlignmentsShortestACTG---ATGGATG
Intermediate AC-TG---ATG-GATG
LongestACTG-----------ATGGATG
ACTGATGGATG
![Page 12: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/12.jpg)
Total Number of Possible Alignments
A G A A T T C A A G GA A A A A T T C G C
AGATCAGAAAT-G--AT-AG-AATCC
m + n
![Page 13: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/13.jpg)
Total Number of Alignments
Select m positions out of m+n possible positions:
!!
)!(
nm
nm
m
nm
Exponential!
If m = 300, n = 300, total = 1037
![Page 14: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/14.jpg)
![Page 15: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/15.jpg)
![Page 16: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/16.jpg)
Needleman and Wunsch Algorithm
• Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j].
• M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]
![Page 17: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/17.jpg)
Needleman and Wunsch Algorithm
• Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j].
• M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]
Dynamic Programming
![Page 18: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/18.jpg)
Dynamic Programming Algorithm
•Initialization •Matrix filling (scoring) •Trace back (alignment)
Three-Step Algorithm:
![Page 19: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/19.jpg)
0 -1 -2 -3 -4 -5 -6 -7
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
-11
-12
A T A G A A T 1. Initialization of Matrix M
A
__
GA
T
CAG
A
AA
TG
i
j
![Page 20: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/20.jpg)
0 -1 -2 -3 -4 -5 -6 -7
-1 1
-2 0
-3 -1
-4 -2
-5 -3
-6 -4
-7 -5
-8 -6
-9 -7
-10 -8
-11 -9
-12 -10
A T A G A A T 2. Fill Matrix
A
__
GA
T
CAG
A
AA
TG
M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]
![Page 21: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/21.jpg)
0 -1 -2 -3 -4 -5 -6 -7
-1 1 0
-2 0 0
-3 -1 -1
-4 -2 0
-5 -3 -1
-6 -4 -2
-7 -5 -3
-8 -6 -4
-9 -7 -5
-10 -8 -6
-11 -9 -7
-12 -10 -8
A T A G A A T 2. Fill Matrix
A
__
GA
T
CAG
A
AA
TG
M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]
![Page 22: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/22.jpg)
0 -1 -2 -3 -4 -5 -6 -7
-1 1 0 -1 -2 -3 -4 -5
-2 0 0 -1 0 -1 -2 -3
-3 -1 -1 1 -1 1 0 -1
-4 -2 0 0 0 0 0 1
-5 -3 -1 -1 -1 -1 -1 0
-6 -4 -2 0 -1 0 0 -1
-7 -5 -3 -1 1 0 -1 -1
-8 -6 -4 -2 0 2 1 0
-9 -7 -5 -3 -1 1 3 2
-10 -8 -6 -4 -2 0 2 2
-11 -9 -7 -5 -3 -1 1 3
-12 -10 -8 -6 -4 -2 0 2
A T A G A A T 2. Fill Matrix
A
__
GA
T
CAG
A
AA
TG
M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]
![Page 23: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/23.jpg)
0 -1 -2 -3 -4 -5 -6 -7
-1 1 0 -1 -2 -3 -4 -5
-2 0 0 -1 0 -1 -2 -3
-3 -1 -1 1 -1 1 0 -1
-4 -2 0 0 0 0 0 1
-5 -3 -1 -1 -1 -1 -1 0
-6 -4 -2 0 -1 0 0 -1
-7 -5 -3 -1 1 0 -1 -1
-8 -6 -4 -2 0 2 1 0
-9 -7 -5 -3 -1 1 3 2
-10 -8 -6 -4 -2 0 2 2
-11 -9 -7 -5 -3 -1 1 3
-12 -10 -8 -6 -4 -2 0 2
A T A G A A T 3. Trace Back
A
__
GA
T
CAG
A
AA
TG
AGATCAGAAATG--AT-AG-AAT-
![Page 24: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/24.jpg)
Pairwise Alignment Algorithm Using Dynamic Programming
• Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column according to scoring matrix.
• For j in 1..n (column) for i in 1..m (row)
M[i,j] = max( (M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i-1, j] + S(i,-) )
Record the selected path toward (i,j)• Report alignment score M[m,n] and trace back to M[0,0] to
generate the optimal alignment.What is the time complexity of this algorithm?
![Page 25: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/25.jpg)
Local Sequence Alignment Using DP
• Biological sequences often only have local similarity. • During evolution, only functional and structural
important regions are highly conserved.• Global alignment sacrifices the local similarity to
maximize the global alignment score. • Need to use alignment method to identify the local
similar regions disregard of other dissimilar regions.
![Page 26: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/26.jpg)
Transcription binding site
![Page 27: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/27.jpg)
Local Alignment Algorithm
Goal: find an alignment of the substrings of P and Q with maximum alignment score.
Naïve Algorithm:(m+1)*m/2 substrings of P, (n+1)*n/2 substrings of QUsing DP for each substring pairs:m2 * n2 * O(mn) = O(m3n3)
![Page 28: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/28.jpg)
Smith-Waterman Algorithm
Same dynamic programming algorithm as global alignment except for three differences.
1. All negative scores is converted to 0 (why?)2. Alignment can start from anywhere in the
matrix3. Alignment can end at anywhere in the
matrix
![Page 29: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/29.jpg)
Local Alignment Algorithm
• Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column to 0s.
• For j in 1..n (column) for i in 1..m (row)
M[i,j] = max( 0, M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i-1, j] + S(i,-) )
Record the selected path.• Find elements in matrix M with maximum values. Trace back
till 0 and report the alignment corresponding to the path.
![Page 30: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/30.jpg)
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
0
0
0
0
0
A T A G A A T 1. Initialization
A
_
_
G
A
T
C
A
G
A
A
A
T
G
![Page 31: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/31.jpg)
0 0 0 0 0 0 0 0
0 1 0 1 0 1 1 0
0 0 0 0 2 1 0 0
0 1 0 1 0 3 2 1
0 0 2 1 0 2 2 3
0 0 1 0 0 1 1 2
0 1 0 2 1 1 2 1
0 0 0 1 3 2 1 1
0 1 0 1 2 4 3 2
0 1 0 1 1 3 5 4
0 1 0 1 0 2 4 4
0 0 2 1 0 1 3 5
0 0 1 1 2 1 2 4
A T A G A A T 2. Fill the Matrix
A
_
_
G
A
T
C
A
G
A
A
A
T
G
![Page 32: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/32.jpg)
0 0 0 0 0 0 0 0
0 1 0 1 0 1 1 0
0 0 0 0 2 1 0 0
0 1 0 1 0 3 2 1
0 0 2 1 0 2 2 3
0 0 1 0 0 1 1 2
0 1 0 2 1 1 2 1
0 0 0 1 3 2 1 1
0 1 0 1 2 4 3 2
0 1 0 1 1 3 5 4
0 1 0 1 0 2 4 4
0 0 2 1 0 1 3 5
0 0 1 1 2 1 2 4
A T A G A A T 3. Trace Back
A
_
_
G
A
T
C
A
G
A
A
A
T
G
![Page 33: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/33.jpg)
Two Best Local Alignments
ATCAGAAAT-AGAA
Local Alignment 1: Local Alignment 2:
ATCAGAAATAT-AGA-AT
![Page 34: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/34.jpg)
Scoring Matrix
• How to accurately measure the similarity between amino acids (or nucleotides) is one key issue of sequence alignment.
• For nucleotides, a simple identical / not identical scheme is mostly ok.
• Due to various properties of amino acids, it is hard and also critical to measure the similarity between amino acids.
![Page 35: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/35.jpg)
Evolutionary Substitution Approach
• During evolution, the substitution of similar amino acids is more likely to be selected within a protein family of similar sequences than random substitutions (M. Dayhoff)
• The frequency / probability that one residue substitutes another one is an indicator of their similarity.
![Page 36: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/36.jpg)
PAM Scoring Matrices(M. Dayhoff)
• Select a number of protein families. • Align sequences in each family and count the
frequency of amino acid substitution in each column: Pij.
• Similarity score is logarithm of the ratio of observed substitution probability over the random substitution probability. S(i,j) = log(Pij / (Pi * Pj)). Pi is the observed probability of residue i and Pj is the observed probability of residue j
• PAM: Point Accepted Mutation• Let data tell strategy! - NBA
![Page 37: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/37.jpg)
A Simplified ExampleChars Prob.
A 6 / 10
C 1 / 10
G 2 / 10
T 1 / 10
A C G T
A 30 6 12 6
C 6 0 2 1
G 12 2 1 2
T 6 1 2 0
ACGTCGAGTACCACGTGTCACACTACTACCGCATGAACCCTATCTTCCGTAACAACCATAAGTAGCATAAGTACTATAAGTACGATAAGT
Substitution Frequency Table
Total number of substitutions: 90
A C G T
A .33 .07 .14 .07
C .07 0 .02 .01
G .14 .02 .01 .02
T .07 .01 .02 0
P(A<->C) = 0.07+0.07=0.14
![Page 38: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/38.jpg)
A Simplified ExampleChars Prob.
A 6 / 10
C 1 / 10
G 2 / 10
T 1 / 10
A C G T
A 30 6 12 6
C 6 0 2 1
G 12 2 2 2
T 6 1 2 0
ACGTCGAGTACCACGTGTCACACTACTACCGCATGAACCCTATCTTCCGTAACAACCATAAGTAGCATAAGTACTATAAGTACGATAAGT
Substitution Frequency Table
Total number of substitutions: 90
A C G T
A .33 .07 .14 .07
C .07 0 .02 .01
G .14 .02 .01 .02
T .07 .01 .02 0
P(A<->C) = 0.07+0.07=0.14S(A,C) = log(0.14/(0.6*0.1)) = 0.36
![Page 39: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/39.jpg)
PAM250 Matrix (log odds multiplied by 10)
![Page 40: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/40.jpg)
BLOSUM Matrices(Henikoff and Henikoff)
• PAM calculation is based on global alignments• PAM matrices don’t work well for aligning
sequences with little similarity. • BLOSUM: BLOcks SUbstitution Matrix • BLOSUM based on highly conserved local
regions /blocks without gaps.
![Page 41: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/41.jpg)
Block 1 Block2
![Page 42: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/42.jpg)
BLOSUM62 Matrix
![Page 43: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/43.jpg)
Significance of Sequence Alignment Score
• Why do we need significant test?• Mathematical view: unusual versus “by
chance”• Biological view: evolutionary related or not?
![Page 44: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/44.jpg)
Randomization Approach
• Randomization is a fundamental idea due to a statistician Fisher.
• Randomly permute chars within sequence P and Q to generate new sequences (P’ and Q’). Align random sequences and record alignment scores.
• Assuming these scores obey normal distribution, compute mean (u) and standard derivation (σ) of alignment scores
![Page 45: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/45.jpg)
Normal distribution of alignment scores of two sequences
95%
•If S = u+2 σ, the probability of observing the alignment score equal to or more extreme than this by chance is 2.5%, e.g., P(S>=u+2 σ) = 2.5%. Thus we are 97.5% confident that the alignment score is not by chance, i.e. significant.•For any score x, we can compute P(S >= x), which is called p-value.
Alignment score
![Page 46: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/46.jpg)
![Page 47: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/47.jpg)
Model-Based Approach (Karlin and Altschul)
http://www.people.virginia.edu/~wrp/cshl02/Altschul/Altschul-3.html
• Extreme Value Distribution
K and λ are statistical parameters depending
on substitution matrix. For BLOSUM62, λ =0.252, K=0.35
![Page 48: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/48.jpg)
P-Value
• P(S≥x) is called p-value. It is the probability that two random sequences has an alignment score >= x.
• Smaller means more significant.• E-value = p-value * # sequences
![Page 49: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/49.jpg)
Problems of Using Dynamic Programming to Search Large Sequence Databases
• Search homologs in DNA and protein database is often the first step of a bioinformatics study.
• DP is too slow for large sequence database search such as Genbank and UniProt. Each DP search can take hours.
• Most DP search time is wasted on unrelated sequences or dissimilar regions.
• Developing fast, practical sequence comparison methods for database search is important.
![Page 50: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/50.jpg)
Fast Sequence Search Methods
• All successful, rapid sequence comparison methods are based on a simple fact: similar sequences share some common words.
• First such method is FASTP (Pearson & Lipman, 1985)• Most widely used methods are BLAST (Altschul et al., 1990)
and PSI-BLAST (Altschul et al., 1997).
![Page 51: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/51.jpg)
1. Compile a list of words for a query
2. Scan sequences in database for word hits
3. Extending hits
Basic Local Alignment Search Tool (S. Altschul, W. Gish, W. Miller, E. Meyer and D. Lipman)
David Lipman
Stephen Altschul
![Page 52: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/52.jpg)
Step 1: Compile Word List
• Words: w-mer with length w. • Protein 4-mer and DNA 12-mer
DSRSKGEPRDSGTLQSQEAKAVKKTSLFE
Words: DSRS, SRSK, RSKG, KGEP….
Query:
![Page 53: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/53.jpg)
Step 2: Scan Database
Classic problem: find occurrence of a list of words in a sequence.
•Integer indexing approach (hashing)
•Deterministic finite automaton or finite state machine (faster)
![Page 54: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/54.jpg)
Finite State Machine• A finite state machine for string matching:
(input alphabet: a,b,c)Word: abab
aa aaaa
22 33 441100 aa aabb bb
Database sequence: Database sequence: bcabccaaababacababacabbbcabccaaababacababacabb
![Page 55: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/55.jpg)
Step 3: Extension
• Extend words on both ends• Terminate the process when we reach a segment
pair whose score falls a certain distance below the best score found for shorter extensions.
• Depart from the ideal of finding a guaranteed Maximum Segment Pair, but the added inaccuracy is negligible.
• Report significant MSP according to extreme value distribution
![Page 56: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/56.jpg)
Words: DSRS, SRSK, RSKG, KGEP….
DSRSKGEPRDSGTLQSQEAKAVKKTSLFEQuery:
Database Sequence: PESRSKGEPRDSGKKQMDSOKPD
An Example of Extension
Maximum Segment Pair: ESRSKGEPRDSG
![Page 57: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/57.jpg)
P-Value and E-Value
• P-value• E-value = database
size * p-value• Common threshold:
0.01S
P-value = Prob(score >=S)
![Page 58: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/58.jpg)
Why is BLAST so Successful?• Address a fundamental problem• Extremely fast• Simple, yet powerful idea• Well founded in computer theory (words-
string matching, hashing, random process of Maximum Segment Pair)
• Implementation tricks -> super speed software
• Sacrifice a little accuracy for speed practically (good heuristics) Realism
![Page 59: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/59.jpg)
NCBI Online Blast
![Page 60: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/60.jpg)
DNA Blast
![Page 61: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/61.jpg)
Protein Blast
![Page 62: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/62.jpg)
Output Format
Matched sequences ranked by score and evalue
Significant local alignments
![Page 63: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.](https://reader036.fdocuments.in/reader036/viewer/2022062320/56649d575503460f94a359de/html5/thumbnails/63.jpg)
Application of Pairwise Sequence Comparison
• Database search• Clustering• Classification • Motif recognition