Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia...

63
Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia...

Page 1: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Sequential Data Mining

Jianlin ChengDepartment of Computer ScienceUniversity of Missouri, Columbia

2011

Page 2: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Sequential Data

• Sequence data: a series of alphabets or numbers• Spatial sequence data: biological sequence, text

document, a computer program, web page, a missile trajectory

• Temporal sequence data: speech, movie clips, stock prices, a series of web clicks

Page 3: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Key Issue

• Measure the similarity between two sequences, i.e. sequence comparison

• Sequence database search / retrieval• Sequence classification• Sequence clustering• Sequence pattern recognition• Sequence similarity definition and calculation

is a challenge compared to vector-based data

Page 4: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Sequence Alignment

• Use biological sequences as an example• Optimal pairwise sequence alignment algorithm• Fast, heuristic sequence alignment (BLAST)• Statistical significance of sequence alignment

score

Page 5: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Sequence Data

• DNA Sequence (gene) gaagagagcttcaggtttggggaagagaca

acaactcccgctcagaagcaggggccgata • RNA Sequence

caaaucacuacacacaggguagaagguggaacgcacaggagcaugucaacggggugc

• Protein Sequence RELQVWGRDNNSLSEAGANRQGDVSFNLPQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEDIDLPGKWKP

Page 6: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Global Pairwise Sequence Alignment ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL

ITAKPQWLKTSESVTFLSFLLPQTQGLYHL

Page 7: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Global Pairwise Sequence Alignment

ITAKPAKT-TSPKEQAIGLSVTFLSFLLPAG-VLYHL

ITAKPQWLKTSE-------SVTFLSFLLPQTQGLYHL

Alignment (similarity) score

ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL

ITAKPQWLKTSESVTFLSFLLPQTQGLYHL

Page 8: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Three Main Issues

1.Definition of alignment score2.Algorithms of finding the optimal alignment3.Evaluation of significance of alignment score

Page 9: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

A Simple Scoring Scheme• Score of a character pair: S(match)=1, S(not_match) = -1, S(gap-char) = -1

• Score of an alignment = n

iS1

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL

ITAKPQWLKSTE-------SVTFLSFLLPQTQGLYHL

5 – 7 – 7 +10 -4 + 4= 1

Page 10: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Optimization Problem

• How many possible alignments exist for two sequences with length m and n?

• How can we find the best alignment to maximize alignment score?

Page 11: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Possible AlignmentsShortestACTG---ATGGATG

Intermediate AC-TG---ATG-GATG

LongestACTG-----------ATGGATG

ACTGATGGATG

Page 12: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Total Number of Possible Alignments

A G A A T T C A A G GA A A A A T T C G C

AGATCAGAAAT-G--AT-AG-AATCC

m + n

Page 13: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Total Number of Alignments

Select m positions out of m+n possible positions:

!!

)!(

nm

nm

m

nm

Exponential!

If m = 300, n = 300, total = 1037

Page 14: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.
Page 15: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.
Page 16: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Needleman and Wunsch Algorithm

• Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j].

• M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]

Page 17: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Needleman and Wunsch Algorithm

• Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j].

• M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]

Dynamic Programming

Page 18: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Dynamic Programming Algorithm

•Initialization •Matrix filling (scoring) •Trace back (alignment)

Three-Step Algorithm:

Page 19: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

0 -1 -2 -3 -4 -5 -6 -7

-1

-2

-3

-4

-5

-6

-7

-8

-9

-10

-11

-12

A T A G A A T 1. Initialization of Matrix M

A

__

GA

T

CAG

A

AA

TG

i

j

Page 20: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

0 -1 -2 -3 -4 -5 -6 -7

-1 1

-2 0

-3 -1

-4 -2

-5 -3

-6 -4

-7 -5

-8 -6

-9 -7

-10 -8

-11 -9

-12 -10

A T A G A A T 2. Fill Matrix

A

__

GA

T

CAG

A

AA

TG

M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]

Page 21: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

0 -1 -2 -3 -4 -5 -6 -7

-1 1 0

-2 0 0

-3 -1 -1

-4 -2 0

-5 -3 -1

-6 -4 -2

-7 -5 -3

-8 -6 -4

-9 -7 -5

-10 -8 -6

-11 -9 -7

-12 -10 -8

A T A G A A T 2. Fill Matrix

A

__

GA

T

CAG

A

AA

TG

M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]

Page 22: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

0 -1 -2 -3 -4 -5 -6 -7

-1 1 0 -1 -2 -3 -4 -5

-2 0 0 -1 0 -1 -2 -3

-3 -1 -1 1 -1 1 0 -1

-4 -2 0 0 0 0 0 1

-5 -3 -1 -1 -1 -1 -1 0

-6 -4 -2 0 -1 0 0 -1

-7 -5 -3 -1 1 0 -1 -1

-8 -6 -4 -2 0 2 1 0

-9 -7 -5 -3 -1 1 3 2

-10 -8 -6 -4 -2 0 2 2

-11 -9 -7 -5 -3 -1 1 3

-12 -10 -8 -6 -4 -2 0 2

A T A G A A T 2. Fill Matrix

A

__

GA

T

CAG

A

AA

TG

M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]

Page 23: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

0 -1 -2 -3 -4 -5 -6 -7

-1 1 0 -1 -2 -3 -4 -5

-2 0 0 -1 0 -1 -2 -3

-3 -1 -1 1 -1 1 0 -1

-4 -2 0 0 0 0 0 1

-5 -3 -1 -1 -1 -1 -1 0

-6 -4 -2 0 -1 0 0 -1

-7 -5 -3 -1 1 0 -1 -1

-8 -6 -4 -2 0 2 1 0

-9 -7 -5 -3 -1 1 3 2

-10 -8 -6 -4 -2 0 2 2

-11 -9 -7 -5 -3 -1 1 3

-12 -10 -8 -6 -4 -2 0 2

A T A G A A T 3. Trace Back

A

__

GA

T

CAG

A

AA

TG

AGATCAGAAATG--AT-AG-AAT-

Page 24: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Pairwise Alignment Algorithm Using Dynamic Programming

• Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column according to scoring matrix.

• For j in 1..n (column) for i in 1..m (row)

M[i,j] = max( (M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i-1, j] + S(i,-) )

Record the selected path toward (i,j)• Report alignment score M[m,n] and trace back to M[0,0] to

generate the optimal alignment.What is the time complexity of this algorithm?

Page 25: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Local Sequence Alignment Using DP

• Biological sequences often only have local similarity. • During evolution, only functional and structural

important regions are highly conserved.• Global alignment sacrifices the local similarity to

maximize the global alignment score. • Need to use alignment method to identify the local

similar regions disregard of other dissimilar regions.

Page 26: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Transcription binding site

Page 27: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Local Alignment Algorithm

Goal: find an alignment of the substrings of P and Q with maximum alignment score.

Naïve Algorithm:(m+1)*m/2 substrings of P, (n+1)*n/2 substrings of QUsing DP for each substring pairs:m2 * n2 * O(mn) = O(m3n3)

Page 28: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Smith-Waterman Algorithm

Same dynamic programming algorithm as global alignment except for three differences.

1. All negative scores is converted to 0 (why?)2. Alignment can start from anywhere in the

matrix3. Alignment can end at anywhere in the

matrix

Page 29: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Local Alignment Algorithm

• Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column to 0s.

• For j in 1..n (column) for i in 1..m (row)

M[i,j] = max( 0, M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i-1, j] + S(i,-) )

Record the selected path.• Find elements in matrix M with maximum values. Trace back

till 0 and report the alignment corresponding to the path.

Page 30: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

0

0

0

0

0

A T A G A A T 1. Initialization

A

_

_

G

A

T

C

A

G

A

A

A

T

G

Page 31: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

0 0 0 0 0 0 0 0

0 1 0 1 0 1 1 0

0 0 0 0 2 1 0 0

0 1 0 1 0 3 2 1

0 0 2 1 0 2 2 3

0 0 1 0 0 1 1 2

0 1 0 2 1 1 2 1

0 0 0 1 3 2 1 1

0 1 0 1 2 4 3 2

0 1 0 1 1 3 5 4

0 1 0 1 0 2 4 4

0 0 2 1 0 1 3 5

0 0 1 1 2 1 2 4

A T A G A A T 2. Fill the Matrix

A

_

_

G

A

T

C

A

G

A

A

A

T

G

Page 32: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

0 0 0 0 0 0 0 0

0 1 0 1 0 1 1 0

0 0 0 0 2 1 0 0

0 1 0 1 0 3 2 1

0 0 2 1 0 2 2 3

0 0 1 0 0 1 1 2

0 1 0 2 1 1 2 1

0 0 0 1 3 2 1 1

0 1 0 1 2 4 3 2

0 1 0 1 1 3 5 4

0 1 0 1 0 2 4 4

0 0 2 1 0 1 3 5

0 0 1 1 2 1 2 4

A T A G A A T 3. Trace Back

A

_

_

G

A

T

C

A

G

A

A

A

T

G

Page 33: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Two Best Local Alignments

ATCAGAAAT-AGAA

Local Alignment 1: Local Alignment 2:

ATCAGAAATAT-AGA-AT

Page 34: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Scoring Matrix

• How to accurately measure the similarity between amino acids (or nucleotides) is one key issue of sequence alignment.

• For nucleotides, a simple identical / not identical scheme is mostly ok.

• Due to various properties of amino acids, it is hard and also critical to measure the similarity between amino acids.

Page 35: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Evolutionary Substitution Approach

• During evolution, the substitution of similar amino acids is more likely to be selected within a protein family of similar sequences than random substitutions (M. Dayhoff)

• The frequency / probability that one residue substitutes another one is an indicator of their similarity.

Page 36: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

PAM Scoring Matrices(M. Dayhoff)

• Select a number of protein families. • Align sequences in each family and count the

frequency of amino acid substitution in each column: Pij.

• Similarity score is logarithm of the ratio of observed substitution probability over the random substitution probability. S(i,j) = log(Pij / (Pi * Pj)). Pi is the observed probability of residue i and Pj is the observed probability of residue j

• PAM: Point Accepted Mutation• Let data tell strategy! - NBA

Page 37: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

A Simplified ExampleChars Prob.

A 6 / 10

C 1 / 10

G 2 / 10

T 1 / 10

A C G T

A 30 6 12 6

C 6 0 2 1

G 12 2 1 2

T 6 1 2 0

ACGTCGAGTACCACGTGTCACACTACTACCGCATGAACCCTATCTTCCGTAACAACCATAAGTAGCATAAGTACTATAAGTACGATAAGT

Substitution Frequency Table

Total number of substitutions: 90

A C G T

A .33 .07 .14 .07

C .07 0 .02 .01

G .14 .02 .01 .02

T .07 .01 .02 0

P(A<->C) = 0.07+0.07=0.14

Page 38: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

A Simplified ExampleChars Prob.

A 6 / 10

C 1 / 10

G 2 / 10

T 1 / 10

A C G T

A 30 6 12 6

C 6 0 2 1

G 12 2 2 2

T 6 1 2 0

ACGTCGAGTACCACGTGTCACACTACTACCGCATGAACCCTATCTTCCGTAACAACCATAAGTAGCATAAGTACTATAAGTACGATAAGT

Substitution Frequency Table

Total number of substitutions: 90

A C G T

A .33 .07 .14 .07

C .07 0 .02 .01

G .14 .02 .01 .02

T .07 .01 .02 0

P(A<->C) = 0.07+0.07=0.14S(A,C) = log(0.14/(0.6*0.1)) = 0.36

Page 39: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

PAM250 Matrix (log odds multiplied by 10)

Page 40: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

BLOSUM Matrices(Henikoff and Henikoff)

• PAM calculation is based on global alignments• PAM matrices don’t work well for aligning

sequences with little similarity. • BLOSUM: BLOcks SUbstitution Matrix • BLOSUM based on highly conserved local

regions /blocks without gaps.

Page 41: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Block 1 Block2

Page 42: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

BLOSUM62 Matrix

Page 43: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Significance of Sequence Alignment Score

• Why do we need significant test?• Mathematical view: unusual versus “by

chance”• Biological view: evolutionary related or not?

Page 44: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Randomization Approach

• Randomization is a fundamental idea due to a statistician Fisher.

• Randomly permute chars within sequence P and Q to generate new sequences (P’ and Q’). Align random sequences and record alignment scores.

• Assuming these scores obey normal distribution, compute mean (u) and standard derivation (σ) of alignment scores

Page 45: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Normal distribution of alignment scores of two sequences

95%

•If S = u+2 σ, the probability of observing the alignment score equal to or more extreme than this by chance is 2.5%, e.g., P(S>=u+2 σ) = 2.5%. Thus we are 97.5% confident that the alignment score is not by chance, i.e. significant.•For any score x, we can compute P(S >= x), which is called p-value.

Alignment score

Page 46: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.
Page 47: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Model-Based Approach (Karlin and Altschul)

http://www.people.virginia.edu/~wrp/cshl02/Altschul/Altschul-3.html

• Extreme Value Distribution

K and λ are statistical parameters depending

on substitution matrix. For BLOSUM62, λ =0.252, K=0.35

Page 48: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

P-Value

• P(S≥x) is called p-value. It is the probability that two random sequences has an alignment score >= x.

• Smaller means more significant.• E-value = p-value * # sequences

Page 49: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Problems of Using Dynamic Programming to Search Large Sequence Databases

• Search homologs in DNA and protein database is often the first step of a bioinformatics study.

• DP is too slow for large sequence database search such as Genbank and UniProt. Each DP search can take hours.

• Most DP search time is wasted on unrelated sequences or dissimilar regions.

• Developing fast, practical sequence comparison methods for database search is important.

Page 50: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Fast Sequence Search Methods

• All successful, rapid sequence comparison methods are based on a simple fact: similar sequences share some common words.

• First such method is FASTP (Pearson & Lipman, 1985)• Most widely used methods are BLAST (Altschul et al., 1990)

and PSI-BLAST (Altschul et al., 1997).

Page 51: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

1. Compile a list of words for a query

2. Scan sequences in database for word hits

3. Extending hits

Basic Local Alignment Search Tool (S. Altschul, W. Gish, W. Miller, E. Meyer and D. Lipman)

David Lipman

Stephen Altschul

Page 52: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Step 1: Compile Word List

• Words: w-mer with length w. • Protein 4-mer and DNA 12-mer

DSRSKGEPRDSGTLQSQEAKAVKKTSLFE

Words: DSRS, SRSK, RSKG, KGEP….

Query:

Page 53: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Step 2: Scan Database

Classic problem: find occurrence of a list of words in a sequence.

•Integer indexing approach (hashing)

•Deterministic finite automaton or finite state machine (faster)

Page 54: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Finite State Machine• A finite state machine for string matching:

(input alphabet: a,b,c)Word: abab

aa aaaa

22 33 441100 aa aabb bb

Database sequence: Database sequence: bcabccaaababacababacabbbcabccaaababacababacabb

Page 55: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Step 3: Extension

• Extend words on both ends• Terminate the process when we reach a segment

pair whose score falls a certain distance below the best score found for shorter extensions.

• Depart from the ideal of finding a guaranteed Maximum Segment Pair, but the added inaccuracy is negligible.

• Report significant MSP according to extreme value distribution

Page 56: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Words: DSRS, SRSK, RSKG, KGEP….

DSRSKGEPRDSGTLQSQEAKAVKKTSLFEQuery:

Database Sequence: PESRSKGEPRDSGKKQMDSOKPD

An Example of Extension

Maximum Segment Pair: ESRSKGEPRDSG

Page 57: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

P-Value and E-Value

• P-value• E-value = database

size * p-value• Common threshold:

0.01S

P-value = Prob(score >=S)

Page 58: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Why is BLAST so Successful?• Address a fundamental problem• Extremely fast• Simple, yet powerful idea• Well founded in computer theory (words-

string matching, hashing, random process of Maximum Segment Pair)

• Implementation tricks -> super speed software

• Sacrifice a little accuracy for speed practically (good heuristics) Realism

Page 59: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

NCBI Online Blast

Page 60: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

DNA Blast

Page 61: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Protein Blast

Page 62: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Output Format

Matched sequences ranked by score and evalue

Significant local alignments

Page 63: Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011.

Application of Pairwise Sequence Comparison

• Database search• Clustering• Classification • Motif recognition