Issues in the Discovery and Use of Motif Patterns Alberto Apostolico University of Padova and Purdue...

45
Issues in the Discovery and Use of Motif Patterns Alberto Apostolico University of Padova and Purdue University

Transcript of Issues in the Discovery and Use of Motif Patterns Alberto Apostolico University of Padova and Purdue...

Issues in the Discovery and Use of Motif Patterns

Alberto ApostolicoUniversity of Padova and Purdue University

A. Apostolico - AofA04

General Form of Pattern Discovery

•Find-exploit a priori unknown patterns or associations thereof in a Data Base

• With some prior domain-specific knowledge• Without any domain-specific prior knowledge

•Tenet: a pattern or association (rule) that occurs more frequently than one would expect is potentially informative and thus interesting frequent = interesting

A. Apostolico - AofA04

Motifs a motif is a recurring pattern with some solid and some ``don’t

care’’ characters or ``gaps’’

Typical PROBLEM Input: textstring Output: repeated motifs

``don’t care’’ characters solid character

T AA G A G G T A G A T AG T

T AA G A G G T A G A T AG T

T AA G A G G T A G A T AG T T AA G A G G T A G A T AG T

Motif discovery is beset by the circumstance that typically there are exponentially many candidate motifs in a sequence

A. Apostolico - AofA04

Motifs a motif is a recurring pattern with some solid and some ``don’t

care’’ characters or ``gaps’’, together with its list of occurrences

Self-correlation Motifs

``don’t care’’ characters solid character

B AA D A D D B A D A B AD B

B AA G A D D B A D A B AD B

B A A D D B A A B B B BA B

Motif discovery is beset by the circumstance that typically there are exponentially many candidate motifs in a sequence

B A D A D D B A D A B AC B

A. Apostolico - AofA04

Controlling Motif Growth: Redundant Motifs(Parida)

A motif is • maximal in composition if specifying more solid characters implies an alteration to its occurrence list• maximal in length if making the motif longer implies an alteration to the cardinality or displacement of its occurrence list

A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant

A. Apostolico - AofA04

Maximal, Redundant, Irredundant Motifs (examples)

Let s= abcdabcd

m_1 = ab with L_1 = { 1, 5 }m_2 = bc with L_2 = { 2, 6 }m_3 = cd with L_3 = { 3, 7 }m_4 = abc with L_4 = { 1, 5 }m_5 = bcd with L_5 = { 2, 6 }m_6 = abcd with L_6 = { 1, 5 }

Notice that L_1 = L_4 = L_6 and L_2 = L_5.

Denoting by L + i the list of j+i such that j is in L, L_5 = L_6 + 1 and L_3 = L_6 + 2

Motif m_6 is maximal as |m_6| > |m_1| , |m_4| and |m_5| > |m_2|. Motifs m_1, m_2, m_3, m_4 and m_5 are non-maximal motifs.

A. Apostolico - AofA04

Maximal, Redundant Irredundant Motifs (examples, cont.)

Let s= aaXbaYdZZZaaVbaWcXXXXaaYbdXc s= aaXbaYdZZZaaVbaWcXXXXaaYbdXc s= aaXbaYdZZZaaVbaWcXXXXaaYbdXc

m_1 = aa . b with L_1 = { 1, 11, 22}m_2 = aa . ba with L_2 = {1, 11}m_3 = aa . b . c with L_3 = {11, 22}

m_1 = aa . b is redundant, since 1) m_1 is a sub-motif of m_2 and of m_3 and 2) L_1 is the union of L_2 and L_3.

A. Apostolico - AofA04

Controlling Motif Growth : HOW MANY Irredundant Motifs

Recall that a motif is • maximal in composition if specifying more solid characters implies an alteration to its occurrence list• maximal in length if making it longer implies an alteration to the cardinality of its occurrence list

A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant

A motif that occurs at least k times in the textstring is a k-motif

TheoremIn any textstring x the number of irredundant 2-motifs is O(|x|)(PROBLEM: How to find irredundant motifs as fast as possible)

A. Apostolico - AofA04

Suffix Consensus, Suffix Meet

suf4

s = suf1

The consensus of suf1 and suff4 is not a motif

The meet of suf1 and suf4 is a maximal motif

a

b

c

a

a

a a

a

a

a

a

aa

a

bb

bbb

ccc

c c b

c

c

A. Apostolico - AofA04

Suffix Consensus, Suffix Meet

suf4

s = suf1

The consensus of suf1 and suff4 is not a motif

The meet of suf1 and suf4 is a maximal motif

TheoremEvery irredundant 2-motif of x is the meet of two suffixes of x

a

b

c

a

a

a a

a

a

a

a

aa

a

bb

bbb

ccc

c c

c

c

A. Apostolico - AofA04

Suffix Consensus, Suffix Meet

suf4

s = suf1

The consensus of suf1 and suff4 is not a motif

The meet of suf1 and suf4 is a maximal motif

TheoremEvery irredundant 2-motif of x is the meet of two suffixes of x

a

b

c

a

a

a a

a

a

a

a

aa

a

bb

bbb

ccc

c c

c

c

A. Apostolico - AofA04

1 Detect Repeated Patterns 2 Set up Dictionary 3 Use Pointers to Dictionary to Encode Replicas

• Most schemes are NP complete (Storer, 78) , • few exceptions (LZ is linear)

Data Compression by Textual Substitution

A. Apostolico - AofA04

LZW

LZW PARADIGM: build a dictionary trie as you scan the input

ROUTINE•Find the next phrase as the longest matching entry in the trie

•Add to the trie the unit symbol extension of this phrase

A. Apostolico - AofA04

LZW

A. Apostolico - AofA04

LZW

LZW PARADIGM: build a dictionary trie as you scan the input

ROUTINE•Find the next phrase as the longest matching entry in the trie

•Add to the trie the unit symbol extension of this phrase

Magics: •It works, no need to send trie•Coding & decoding are symmetric

A. Apostolico - AofA04

Fast and Lossy is Hard

``All universal lossy coding schemes found to date lack the relative simplicity that imbues Lempel-Ziv codes and arithmetic codes with economic viability. Perhaps as a consequence of the fact that approximate matches abound whereas exact matches are unique,it is inherently much faster to look for an exact match that it is to search from a plethora of approximate matches looking for the best, or even nearly the best, among them. The right way to trade off search effort in a poorly understood environment against the degree to which the product of the search possesses desired criteria has long been a human enigma. This suggests it is unlikely that the ``holy grail'' of implementable universal lossy source coding will be discovered soon.''

T. Berger and J.D. Gibson,``Lossy Source Coding,'‘ IEEE Trans. on Inform. Theory, vol. 44, No. 6, pp. 2693--2723, 1998.

A. Apostolico - AofA04

Why Fast and Lossy is Hard

Routine: Find longest prefix of incoming string matching past occurrence within some distortion

PROBLEMS

• Defining the Gaps• Encoding where are the gaps• Finding the longest match

A. Apostolico - AofA04

Fast and Lossy is Hard

``All universal lossy coding schemes found to date lack the relative simplicity that imbues Lempel-Ziv codes and arithmetic codes with economic viability. Perhaps as a consequence of the fact that approximate matches abound whereas exact matches are unique,it is inherently much faster to look for an exact match that it is to search from a plethora of approximate matches looking for the best, or even nearly the best, among them. The right way to trade off search effort in a poorly understood environment against the degree to which the product of the search possesses desired criteria has long been a human enigma. This suggests it is unlikely that the ``holy grail'' of implementable universal lossy source coding will be discovered soon.''

T. Berger and J.D. Gibson,``Lossy Source Coding,'‘ IEEE Trans. on Inform. Theory, vol. 44, No. 6, pp. 2693--2723, 1998.

A. Apostolico - AofA04

Why Fast and Lossy is Hard: LZW Recap

LZW PARADIGM: build a dictionary trie as you scan the input

ROUTINE•Find the next phrase as the longest matching entry in the trie

•Add to the trie the unit symbol extension of this phrase

A. Apostolico - AofA04

LZW

A. Apostolico - AofA04

Towards an online lzw using motifs

A. Apostolico - AofA04

Towards an online lzw using motifs

A. Apostolico - AofA04

lzw

A. Apostolico - AofA04

lzw-a

A. Apostolico - AofA04

Original LZW parse of

A. Apostolico - AofA04

A motif-driven LZW parse of

•Lossless with resolvers

•Lossy without

A. Apostolico - AofA04

motif lzw, results

A. Apostolico - AofA04

Motif Disambiguation• By Guessing

DESCRIPTION OF FARMER OAK -- AN INCIDENT When Farmer Oak smile., the corners .fhis mouth spread till the. were within an unimportant distance .f his ears, his eye. were reduced to chinks, and ...erging wrinkle—red round them, extending upon... countenance li.e the rays in a rudimentary sketch of the rising sun. HisChristian name was Gabriel, and on working days he was a young man of soundjudgment,easy motions, proper dress, and ...eral good character. On Sundays,he was a man of misty views rather given to postponing, and .ampered by his bestclotes and umbrella : upon ... whole, one who felt himself to occupy morally that... middle space of Laodicean neutrality which ... between the Communion people ofthe parish and the drunken section, -- that ... he went to church, but yawnedprivately by the t.ime the cong.egation reached the Nicene creed,- and thoughtof what there would be for dinner when he meant to be listening to the sermon.

DESCRIPTION OF FARMER OAK -- AN INCIDENT When Farmer Oak smiled, the corners ofhis mouth spread till they were within an unimportant distance of his ears, hiseyes were reduced to chinks, and diverging wrinkles appeared round them, extending uponhis countenance like the rays in a rudimentary sketch of the rising sun. HisChristian name was Gabriel, and on working days he was a young man of soundjudgment, easy motions, proper dress, and general good character. On Sundayshe was a man of misty views, rather given to postponing, and hampered by his bestclothes and umbrella : upon the whole, one who felt himself to occupy morally thatvast middle space of Laodicean neutrality which lay between the Communion people ofthe parish and the drunken section, -- that is, he went to church, but yawnedprivately by the time the congregation reached the Nicene creed,- and thoughtof what there would be for dinner when he meant to be listening to the sermon.

A. Apostolico - AofA04

Motif Resolution• By Completion (bilateral contexts better

predictors)

A. Apostolico - AofA04

Motif Resolution

• By interpolation at receiver (images and sounds)

A. Apostolico - AofA04

A. Apostolico - AofA04

A. Apostolico - AofA04

A. Apostolico - AofA04

Expected match length within distortion

A. Apostolico - AofA04

Expected match length within distortion - continued

A. Apostolico - AofA04

Expected match length within distortion - continued

A. Apostolico - AofA04

Giving up on ``longest match’’1 – Expected length with exact matches

2 – Expected length with with distortion d

=

A. Apostolico - AofA04

Giving up on ``longest match’’ continued

A. Apostolico - AofA04

…but LZ works because phrases are DISTINCT

• A most crowded parse

• Achieves maximum number of phrases in a parse

• #phrases < n / log n

A. Apostolico - AofA04

LZWA is not necessarily better than LZW

x = aaaaaaaaaaaa………a

But compare vocabularies under alphabet compression

A. Apostolico - AofA04

LZW versus Lossy LZWA (Comparing Vocabulary Build-ups)

A. Apostolico - AofA04

LZW versus Lossy LZWA (Comparing Vocabulary Build-ups)

contiued

A. Apostolico - AofA04

Conclusions

-Self-correlation Motifs give versatile compression schemata for a variety of inpus

- “Plier la machine’’ approach, bridges lossless and lossy

- Linear time lossy variant with reasonable performance

- Deeper analysis, broad experimentation of fine tuned variants and several extensions needed, some under way

A. Apostolico - AofA04

Conclusions

“Analyze That’’ , please

Thank you

A. Apostolico - AofA04

Main References

• A. Apostolico ``Pattern Discovery and the Algorithmics of Surprise'' Proceedings of the NATO ASI on Artificial Intelligence and Heuristic Methods for Bioinformatics, (P. Frasconi and R. Shamir, eds.) IOS Press, 111--127 (2003).

• A. Apostolico and L. Parida ``Incremental Paradigms of Motif Discovery'', Journal of Computational Biology 11:1, 15--25 (2004).

• A. Apostolico M. Comin and L. Parida. ``Motifs in Ziv-Lempel-Welch Clef'‘ Proceedings of IEEE DCC Data Compression Conference, pp. 72—81 (2004).

• A. Apostolico. ``Fast Gapped Variants for Lempel-Ziv-Welch Compression'',in preparation.