Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia...

Sequential Data Mining

Jianlin ChengDepartment of Computer ScienceUniversity of Missouri, Columbia

2011

Sequential Data

• Sequence data: a series of alphabets or numbers• Spatial sequence data: biological sequence, text

document, a computer program, web page, a missile trajectory

• Temporal sequence data: speech, movie clips, stock prices, a series of web clicks

Key Issue

• Measure the similarity between two sequences, i.e. sequence comparison

• Sequence database search / retrieval• Sequence classification• Sequence clustering• Sequence pattern recognition• Sequence similarity definition and calculation

is a challenge compared to vector-based data

Sequence Alignment

• Use biological sequences as an example• Optimal pairwise sequence alignment algorithm• Fast, heuristic sequence alignment (BLAST)• Statistical significance of sequence alignment

score

Sequence Data

• DNA Sequence (gene) gaagagagcttcaggtttggggaagagaca

acaactcccgctcagaagcaggggccgata • RNA Sequence

caaaucacuacacacaggguagaagguggaacgcacaggagcaugucaacggggugc

• Protein Sequence RELQVWGRDNNSLSEAGANRQGDVSFNLPQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEDIDLPGKWKP

Global Pairwise Sequence Alignment ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL

ITAKPQWLKTSESVTFLSFLLPQTQGLYHL

Global Pairwise Sequence Alignment

ITAKPAKT-TSPKEQAIGLSVTFLSFLLPAG-VLYHL

ITAKPQWLKTSE-------SVTFLSFLLPQTQGLYHL

Alignment (similarity) score

ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL

ITAKPQWLKTSESVTFLSFLLPQTQGLYHL

Three Main Issues

1.Definition of alignment score2.Algorithms of finding the optimal alignment3.Evaluation of significance of alignment score

A Simple Scoring Scheme• Score of a character pair: S(match)=1, S(not_match) = -1, S(gap-char) = -1

• Score of an alignment = n

iS1

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL

ITAKPQWLKSTE-------SVTFLSFLLPQTQGLYHL

5 – 7 – 7 +10 -4 + 4= 1

Optimization Problem

• How many possible alignments exist for two sequences with length m and n?

• How can we find the best alignment to maximize alignment score?

Possible AlignmentsShortestACTG---ATGGATG

Intermediate AC-TG---ATG-GATG

LongestACTG-----------ATGGATG

ACTGATGGATG

Total Number of Possible Alignments

A G A A T T C A A G GA A A A A T T C G C

AGATCAGAAAT-G--AT-AG-AATCC

m + n

Total Number of Alignments

Select m positions out of m+n possible positions:

!!

)!(

nm

nm

m

nm

Exponential!

If m = 300, n = 300, total = 1037

Needleman and Wunsch Algorithm

• Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j].

• M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]

Needleman and Wunsch Algorithm

• Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j].

• M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]

Dynamic Programming

Dynamic Programming Algorithm

•Initialization •Matrix filling (scoring) •Trace back (alignment)

Three-Step Algorithm:

0 -1 -2 -3 -4 -5 -6 -7

-1

-2

-3

-4

-5

-6

-7

-8

-9

-10

-11

-12

A T A G A A T 1. Initialization of Matrix M

A

__

GA

T

CAG

A

AA

TG

i

j

0 -1 -2 -3 -4 -5 -6 -7

-1 1

-2 0

-3 -1

-4 -2

-5 -3

-6 -4

-7 -5

-8 -6

-9 -7

-10 -8

-11 -9

-12 -10

A T A G A A T 2. Fill Matrix

A

__

GA

T

CAG

A

AA

TG

M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]

0 -1 -2 -3 -4 -5 -6 -7

-1 1 0

-2 0 0

-3 -1 -1

-4 -2 0

-5 -3 -1

-6 -4 -2

-7 -5 -3

-8 -6 -4

-9 -7 -5

-10 -8 -6

-11 -9 -7

-12 -10 -8


A

__

GA

T

CAG

A

AA

TG


0 -1 -2 -3 -4 -5 -6 -7

-1 1 0 -1 -2 -3 -4 -5

-2 0 0 -1 0 -1 -2 -3

-3 -1 -1 1 -1 1 0 -1

-4 -2 0 0 0 0 0 1

-5 -3 -1 -1 -1 -1 -1 0

-6 -4 -2 0 -1 0 0 -1

-7 -5 -3 -1 1 0 -1 -1

-8 -6 -4 -2 0 2 1 0

-9 -7 -5 -3 -1 1 3 2

-10 -8 -6 -4 -2 0 2 2

-11 -9 -7 -5 -3 -1 1 3

-12 -10 -8 -6 -4 -2 0 2


A

__

GA

T

CAG

A

AA

TG


0 -1 -2 -3 -4 -5 -6 -7

-1 1 0 -1 -2 -3 -4 -5

-2 0 0 -1 0 -1 -2 -3

-3 -1 -1 1 -1 1 0 -1

-4 -2 0 0 0 0 0 1

-5 -3 -1 -1 -1 -1 -1 0

-6 -4 -2 0 -1 0 0 -1

-7 -5 -3 -1 1 0 -1 -1

-8 -6 -4 -2 0 2 1 0

-9 -7 -5 -3 -1 1 3 2

-10 -8 -6 -4 -2 0 2 2

-11 -9 -7 -5 -3 -1 1 3

-12 -10 -8 -6 -4 -2 0 2

A T A G A A T 3. Trace Back

A

__

GA

T

CAG

A

AA

TG

AGATCAGAAATG--AT-AG-AAT-

Pairwise Alignment Algorithm Using Dynamic Programming

• Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column according to scoring matrix.

• For j in 1..n (column) for i in 1..m (row)

M[i,j] = max( (M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i-1, j] + S(i,-) )

Record the selected path toward (i,j)• Report alignment score M[m,n] and trace back to M[0,0] to

generate the optimal alignment.What is the time complexity of this algorithm?

Local Sequence Alignment Using DP

• Biological sequences often only have local similarity. • During evolution, only functional and structural

important regions are highly conserved.• Global alignment sacrifices the local similarity to

maximize the global alignment score. • Need to use alignment method to identify the local

similar regions disregard of other dissimilar regions.

Transcription binding site

Local Alignment Algorithm

Goal: find an alignment of the substrings of P and Q with maximum alignment score.

Naïve Algorithm:(m+1)*m/2 substrings of P, (n+1)*n/2 substrings of QUsing DP for each substring pairs:m2 * n2 * O(mn) = O(m3n3)

Smith-Waterman Algorithm

Same dynamic programming algorithm as global alignment except for three differences.

1. All negative scores is converted to 0 (why?)2. Alignment can start from anywhere in the

matrix3. Alignment can end at anywhere in the

matrix

Local Alignment Algorithm

• Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column to 0s.

• For j in 1..n (column) for i in 1..m (row)

M[i,j] = max( 0, M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i-1, j] + S(i,-) )

Record the selected path.• Find elements in matrix M with maximum values. Trace back

till 0 and report the alignment corresponding to the path.

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

0

0

0

0

0

A T A G A A T 1. Initialization

A

_

_

G

A

T

C

A

G

A

A

A

T

G

0 0 0 0 0 0 0 0

0 1 0 1 0 1 1 0

0 0 0 0 2 1 0 0

0 1 0 1 0 3 2 1

0 0 2 1 0 2 2 3

0 0 1 0 0 1 1 2

0 1 0 2 1 1 2 1

0 0 0 1 3 2 1 1

0 1 0 1 2 4 3 2

0 1 0 1 1 3 5 4

0 1 0 1 0 2 4 4

0 0 2 1 0 1 3 5

0 0 1 1 2 1 2 4

A T A G A A T 2. Fill the Matrix

A

_

_

G

A

T

C

A

G

A

A

A

T

G

0 0 0 0 0 0 0 0

0 1 0 1 0 1 1 0

0 0 0 0 2 1 0 0

0 1 0 1 0 3 2 1

0 0 2 1 0 2 2 3

0 0 1 0 0 1 1 2

0 1 0 2 1 1 2 1

0 0 0 1 3 2 1 1

0 1 0 1 2 4 3 2

0 1 0 1 1 3 5 4

0 1 0 1 0 2 4 4

0 0 2 1 0 1 3 5

0 0 1 1 2 1 2 4

A T A G A A T 3. Trace Back

A

_

_

G

A

T

C

A

G

A

A

A

T

G

Two Best Local Alignments

ATCAGAAAT-AGAA

Local Alignment 1: Local Alignment 2:

ATCAGAAATAT-AGA-AT

Scoring Matrix

• How to accurately measure the similarity between amino acids (or nucleotides) is one key issue of sequence alignment.

• For nucleotides, a simple identical / not identical scheme is mostly ok.

• Due to various properties of amino acids, it is hard and also critical to measure the similarity between amino acids.

Evolutionary Substitution Approach

• During evolution, the substitution of similar amino acids is more likely to be selected within a protein family of similar sequences than random substitutions (M. Dayhoff)

• The frequency / probability that one residue substitutes another one is an indicator of their similarity.

PAM Scoring Matrices(M. Dayhoff)

• Select a number of protein families. • Align sequences in each family and count the

frequency of amino acid substitution in each column: Pij.

• Similarity score is logarithm of the ratio of observed substitution probability over the random substitution probability. S(i,j) = log(Pij / (Pi * Pj)). Pi is the observed probability of residue i and Pj is the observed probability of residue j

• PAM: Point Accepted Mutation• Let data tell strategy! - NBA

A Simplified ExampleChars Prob.

A 6 / 10

C 1 / 10

G 2 / 10

T 1 / 10

A C G T

A 30 6 12 6

C 6 0 2 1

G 12 2 1 2

T 6 1 2 0

ACGTCGAGTACCACGTGTCACACTACTACCGCATGAACCCTATCTTCCGTAACAACCATAAGTAGCATAAGTACTATAAGTACGATAAGT

Substitution Frequency Table

Total number of substitutions: 90

A C G T

A .33 .07 .14 .07

C .07 0 .02 .01

G .14 .02 .01 .02

T .07 .01 .02 0

P(A<->C) = 0.07+0.07=0.14

A Simplified ExampleChars Prob.

A 6 / 10

C 1 / 10

G 2 / 10

T 1 / 10

A C G T

A 30 6 12 6

C 6 0 2 1

G 12 2 2 2

T 6 1 2 0

ACGTCGAGTACCACGTGTCACACTACTACCGCATGAACCCTATCTTCCGTAACAACCATAAGTAGCATAAGTACTATAAGTACGATAAGT

Substitution Frequency Table

Total number of substitutions: 90

A C G T

A .33 .07 .14 .07

C .07 0 .02 .01

G .14 .02 .01 .02

T .07 .01 .02 0

P(A<->C) = 0.07+0.07=0.14S(A,C) = log(0.14/(0.6*0.1)) = 0.36

PAM250 Matrix (log odds multiplied by 10)

BLOSUM Matrices(Henikoff and Henikoff)

• PAM calculation is based on global alignments• PAM matrices don’t work well for aligning

sequences with little similarity. • BLOSUM: BLOcks SUbstitution Matrix • BLOSUM based on highly conserved local

regions /blocks without gaps.

Block 1 Block2

BLOSUM62 Matrix

Significance of Sequence Alignment Score

• Why do we need significant test?• Mathematical view: unusual versus “by

chance”• Biological view: evolutionary related or not?

Randomization Approach

• Randomization is a fundamental idea due to a statistician Fisher.

• Randomly permute chars within sequence P and Q to generate new sequences (P’ and Q’). Align random sequences and record alignment scores.

• Assuming these scores obey normal distribution, compute mean (u) and standard derivation (σ) of alignment scores

Normal distribution of alignment scores of two sequences

95%

•If S = u+2 σ, the probability of observing the alignment score equal to or more extreme than this by chance is 2.5%, e.g., P(S>=u+2 σ) = 2.5%. Thus we are 97.5% confident that the alignment score is not by chance, i.e. significant.•For any score x, we can compute P(S >= x), which is called p-value.

Alignment score

Model-Based Approach (Karlin and Altschul)

http://www.people.virginia.edu/~wrp/cshl02/Altschul/Altschul-3.html

• Extreme Value Distribution

K and λ are statistical parameters depending

on substitution matrix. For BLOSUM62, λ =0.252, K=0.35

P-Value

• P(S≥x) is called p-value. It is the probability that two random sequences has an alignment score >= x.

• Smaller means more significant.• E-value = p-value * # sequences

Problems of Using Dynamic Programming to Search Large Sequence Databases

• Search homologs in DNA and protein database is often the first step of a bioinformatics study.

• DP is too slow for large sequence database search such as Genbank and UniProt. Each DP search can take hours.

• Most DP search time is wasted on unrelated sequences or dissimilar regions.

• Developing fast, practical sequence comparison methods for database search is important.

Fast Sequence Search Methods

• All successful, rapid sequence comparison methods are based on a simple fact: similar sequences share some common words.

• First such method is FASTP (Pearson & Lipman, 1985)• Most widely used methods are BLAST (Altschul et al., 1990)

and PSI-BLAST (Altschul et al., 1997).

1. Compile a list of words for a query

2. Scan sequences in database for word hits

3. Extending hits

Basic Local Alignment Search Tool (S. Altschul, W. Gish, W. Miller, E. Meyer and D. Lipman)

David Lipman

Stephen Altschul

Step 1: Compile Word List

• Words: w-mer with length w. • Protein 4-mer and DNA 12-mer

DSRSKGEPRDSGTLQSQEAKAVKKTSLFE

Words: DSRS, SRSK, RSKG, KGEP….

Query:

Step 2: Scan Database

Classic problem: find occurrence of a list of words in a sequence.

•Integer indexing approach (hashing)

•Deterministic finite automaton or finite state machine (faster)

Finite State Machine• A finite state machine for string matching:

(input alphabet: a,b,c)Word: abab

aa aaaa

22 33 441100 aa aabb bb

Database sequence: Database sequence: bcabccaaababacababacabbbcabccaaababacababacabb

Step 3: Extension

• Extend words on both ends• Terminate the process when we reach a segment

pair whose score falls a certain distance below the best score found for shorter extensions.

• Depart from the ideal of finding a guaranteed Maximum Segment Pair, but the added inaccuracy is negligible.

• Report significant MSP according to extreme value distribution

Words: DSRS, SRSK, RSKG, KGEP….

DSRSKGEPRDSGTLQSQEAKAVKKTSLFEQuery:

Database Sequence: PESRSKGEPRDSGKKQMDSOKPD

An Example of Extension

Maximum Segment Pair: ESRSKGEPRDSG

P-Value and E-Value

• P-value• E-value = database

size * p-value• Common threshold:

0.01S

P-value = Prob(score >=S)

Why is BLAST so Successful?• Address a fundamental problem• Extremely fast• Simple, yet powerful idea• Well founded in computer theory (words-

string matching, hashing, random process of Maximum Segment Pair)

• Implementation tricks -> super speed software

• Sacrifice a little accuracy for speed practically (good heuristics) Realism

NCBI Online Blast

DNA Blast

Protein Blast

Output Format

Matched sequences ranked by score and evalue

Significant local alignments

Application of Pairwise Sequence Comparison

• Database search• Clustering• Classification • Motif recognition

Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia...

Documents

Transcript of Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia...