SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson...

33
SPLASH: S tructural P attern L ocalization A nalysis by S equential H istograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th , 2004

Transcript of SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson...

Page 1: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

SPLASH: Structural Pattern Localization Analysis by Sequential HistogramsA. Califano, IBM TJ Watson

Presented by Tao Tao

April 14th, 2004

Page 2: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Motif: A functional regain of a DNA or protein sequence

How to discover the functional regains automatically?

Amino Acids sequences

Page 3: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Automatic Motif discovery

Problem- Use A, B, C, … stands for different amino acids- A protein sequence: ABABAABCDBAA…- Motifs are certain patterns in sequences

for example: ABCA Previous Methods: small scale discovery- Several sequences similar functions alignment Can we use data mining to generate motifs

candidates first?

Page 4: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Automatically discover motifs: What properties should a motif have? It has a specific function conservative

frequent appearing in sequences Evolution likely not continually identical

For example: ABCBABABA

AB--ABAB- string matching, suffix tree …

AB-BAB-B- how?

Page 5: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Formal problem definition

Input: A string of characters: S=s1s2,…,sL

Output: A frequent pattern: (∑ U ●)* ●: a wild card to match a single character, ∑: a full character * : repeat arbitrary times

Note: NO arbitrary-length gap.

ABCD, AED are different

Page 6: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Regular Expression: to describe a certain type of patterns | or : A|B means A or B ● wild card to match any characters

A●B means: AAB, ABB, ACB, … * to repeat any times (including 0 times)

(AB)* means null, AB, ABAB, ABABAB, … + to repeat any times (not including 0 times) …

Page 7: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Any requirements for output patterns? Can wild card be anywhere? Do we need

some constraints on wild cards?

What means “frequent”?

How long should a qualified pattern be?

Page 8: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Can wild card be anywhere?

A pattern can have ●: for example A●BA●●B But, A●●●●●●●●●●●●●●●●BBA ??

Probably, it cannot be too “sparse”… Naïve solution: no more than n ● But, for example n=5

A●●●●●B : 5 ●

A●BB●●A●●B●●A : 7●

Page 9: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Given a pattern P, any length l0 region in P must have k0 full characters

Example: l0 = 5, k0 = 3

s1s2s3s4s5s6s7s8s9……

Density: how “sparse” do we allow?

Two ● at most

Page 10: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Given a pattern P, any length l0 region in P must have k0 full characters

Example: l0 = 5, k0 = 3

s1s2s3s4s5s6s7s8s9……

Density: how “sparse” do we allow?

Two ● at most

Page 11: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Given a pattern P, any length l0 region in P must have k0 full characters

Example: l0 = 5, k0 = 3

s1s2s3s4s5s6s7s8s9……

Density: how “sparse” do we allow?

Two ● at most

Page 12: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Given a pattern P, any length l0 region in P must have k0 full characters

Example: l0 = 5, k0 = 3

s1s2s3s4s5s6s7s8s9……

A●●ABB●A √ BA●●A●BB X

Density: how “sparse” do we allow?

Page 13: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Frequency and length

At least, the patterns have K0 full characters repeating J0 times

Example J0 =3

ABCBABABA √

ABCBABABA X Example K0 =3

ABCBABABA X

ABCBABABA √

Page 14: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Summary of parameters for a pattern Sequence S, and its length L Pattern P, K full character, appears J times Length constraints: K ≥ K0

Frequency constraint: J ≥ J0

Density constraint: l0 , k0

Page 15: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Apriori property

A constraint has a-priori property means:

If a set violates this constraint, any its superset will violate this constraint as well.

For example max(S) < 5

Frequency constraint has a-priori property!

For example, BA●A●BB appears less than J0 times, any its super patterns CANNOT appears more than J0 times!

Page 16: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

A whole picture of the algorithm To form longer pattern only from short qualified patterns.- First, to generate candidates/seed (length l0): every seed

should repeat at least J0 times

- To generate longer patterns from short patterns, iteratively

1. Two patterns are together

2. Longer patterns repeat at least J0

……

Page 17: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Generate the seeds: enumerating … To generate seeds (shortest patterns) first

ABAABBCBACBDB… J0=4 A: 4, B: 6, C:2, D: 1 Are length 1 seeds too short? How long could those

seeds be? - Too long: enumerating costs too much time- Too short: maybe not efficient, also not consider the

density constraints Maybe, we should start from the patterns with length

l0 .

Page 18: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

How to generate seeds with length l0 ? Give l0 and k0, and character sets ABC… Enumerating all possible patterns with length l0 Scan the sequence the count the frequency

For example, l0=3, k0=2, ABCAAA, AAB, AAC, ABA, …

AA●, AB●, AC●, … A●A, A●B, A●C, … …

Page 19: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Can we do it more efficiently? Give l0 and k0,

1: full character 0:wild card

Enumerating all possible patterns by 1 and 0? Example l0=5, k0=3, to find comb

11111, 11110, 11101, 11100, 11011,

11010, 11001, 10111, 10110, 10101

10011, 01111, 01110, 01101, 01011,

00111

Page 20: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

How to use comb? For example 10101A B A A B A B B A B A B A B B A A B

A●A●B

Page 21: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

How to use comb? For example 10101A B A A B A B B A B A B A B B A A B

B●A●A

Page 22: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

How to use comb? For example 10101A B A A B A B B A B A B A B B A A B

A●B●B

Page 23: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

How to use comb? For example 10101A B A A B A B B A B A B A B B A A B

A●A●B 3, B●A●A 2, A●B●B 2B●B●A 2, B●B●B 2, A●A●A 1A●B●A 1, B●A●B 1

J0 = 3? only A●A●B left

By the same way, use others combs to generate other seeds, different combs won’t generate the same patterns

Page 24: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

How to get long patterns?

Long pattern two patterns could be merged need short patterns and their locations

- Pattern: A●B●●C {A:0,B:2,C:5}- Locus: the locations where a pattern occurs:

Patten AB in string ABBCABAB

Its locus {0, 4, 6}

Page 25: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Append operation: to connect two small patterns to a longer patternPatten S1: A●●B●C and S2: B●D●

S1S2: A●●B●CB●D●

conditional on: Their locus have intersection

S1 locus: {1, 20, 32, 57 …}

S2 locus: {7, 13, 38, 63 … } {1,7,32,57,…}

S1S2 locus: {1,32,57,…}

-6

Page 26: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Add: to make the patterns more “dense” Patten A●●B●C●●●D and ●●●B●CE

A●●B●CE●●D

on the conditions:

Their locus (with shifting) have intersection

Page 27: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Significance: whether it can be generated by randomly sampling ? hypothesis: A pattern is not randomly generated

Given: character set: {A,B,C,D,E} sequence length: L A pattern: A●BA●AA●B Its frequency jProbability to generate this pattern j times pure randomly?

Page 28: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Statistical significance

Pure random sampling, the frequency should satisfy normal distribution

Z score, (A-E[A])/σA --- normalized into N(0,1)

Page 29: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Experiments

Two questions to answer.- How efficient is this algorithm?- How effective is this algorithm?

Baseline algorithm- PRATT(EBI), MEME(UCSD)

Page 30: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Efficiency

SPLASH

PRATT

Page 31: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Effectiveness

Search against SWISS-PROT Rel. 36,

578 GPCR proteins returned, only 4 false positive

MEME cannot find it, PRATT program crashed

Page 32: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Conclusions

Deterministic algorithm: It can discover all patterns satisfying the requirements

Efficient and scalable: It beats PRATT and MEME. More scalable …

Effective: It can discover useful patterns.

Page 33: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Problems

All problems that A-priori algorithm could have: too many results, cannot really avoid worse-case exponential …

Doesn’t really consider the 3D structure of proteins

The software crashes sometimes