Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub,...
-
Upload
caitlin-stafford -
Category
Documents
-
view
223 -
download
0
description
Transcript of Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub,...
![Page 1: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/1.jpg)
Minimized compact automaton for Minimized compact automaton for clumps over degenerate patternsclumps over degenerate patterns
Evgeniia Furletova*, Jan Holub, Mireille Regnier
November 27, 2015
Institute of mathematical problems in biology, Russia
![Page 2: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/2.jpg)
CollaboratorsCollaborators
Mireille RegnierEcole Polytechnique, INRIA, France
Jan HolubCzech Technical University in Prague, Czech Republic
![Page 3: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/3.jpg)
ChallengeChallenge
Problem. Creating of an efficient algorithm of pattern occurrences P-value computation.
Functional fragments recognition in biological sequences can be reduced to finding of overrepresented occurrences of a pattern.
A measure of overrepresentation is PP-value-value of pattern occurrences
![Page 4: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/4.jpg)
PP-value of pattern occurrences-value of pattern occurrences
P-valueP-value isis the probability to find at least one occurrence of words the probability to find at least one occurrence of words from a pattern H in a random sequence of length n generated from a pattern H in a random sequence of length n generated according to a given probability model.according to a given probability model.
'
11 ,(1 ( ))
nP valueC
•C(z) – generating function of clumps; •ρ – closest to 1 root of 1 – z+C(z) = 0
Regnier M., Fang B, Iakovishina D. Clump Combinatorics, Automata, and Word Asymptotics// Proceedings of the Eleventh Workshop on Analytic Algorithmics and Combinatorics (ANALCO). 2014
For a Bernoulli model P-value can be approximated by the formula* :
![Page 5: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/5.jpg)
ClumpsClumps
ExamplesExamples of clumps for pattern ACATTACA
k-clump for a pattern H = {h1,…,hr} is a string s such that:• s consists of k overlapping occurrences of H• any two consecutive letters of s belong to an occurrence of H
• ACATTACA 1-clump
ACATTACA ACATTACA ACATTACA
• ACATTACATTACACATTACA 3-clump
![Page 6: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/6.jpg)
Our goal is to create an efficient method for computation of probabilities of k-clumps
0 1( ) ... ,nnC z p p z p z
Clumps generating functionClumps generating function
pk – sum of probabilities of all k-clumps.
![Page 7: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/7.jpg)
Degenerate (intermediate) patternsDegenerate (intermediate) patterns
Degenerate alphabet Σ’ – alphabet letters of which are subsets of alphabet Σ.Degenerate pattern is a string in Σ’
Example:Example: IUPAC alphabet
A = [A]C = [C]G = [G]T = [T]R = [AG]Y = [CT]S = [CG] …N = [ACGT]
Examples:Examples: IUPAC consensuses
ТАТА-box ТAТA[AТ]A[AТ] – 4 words of length 7
Consensus of transcription factor binding site Antp (Drosophila) ANNNNCATTA – 256 words of length 10
![Page 8: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/8.jpg)
Pattern matching (Aho-Corasick) automaton for degenerate pattern H = A[CT]A
0
1
2 3
4 5
A
C T
A A
![Page 9: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/9.jpg)
0
1
2 3
4 5
A
C T
A
Pattern matching (Aho-Corasick) automaton for degenerate pattern H = A[CT]A
Clumps: ACA, ATA, ACACA, ACATA,….
A
![Page 10: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/10.jpg)
Overlap walking automaton* for H = A[CT]A
0
1
2 3
4 5
A
C T
A A
Clumps: ACA, ATA, ACACA, ACATA,….
0
ACA
5
ACA ATA
CA TA
CA
TA
* Regnier M., 2014
Pattern matching automaton for H = A[CT]A
Overlap walking automaton Overlap walking automaton
4
ATA
![Page 11: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/11.jpg)
We propose a minimization of overlap walking automaton for degenerate patterns
![Page 12: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/12.jpg)
Pattern matching automata minimization degenerate pattern H = [AT][CG][AC]
![Page 13: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/13.jpg)
[AT]
[CG]
[A] [C]
0
1
2
3 4
Minimal pattern matching automaton degenerate pattern H = [AT][CG][AC]
This automaton can be constructed in linear time of its states
![Page 14: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/14.jpg)
RR-equivalence-equivalence
Nodes x and y are R-equivalent (x R~ y) iff x = y or1. |x|=|y|;2. suffix_link(x) R~ suffix_link(y).
For degenerate patterns, the nodes of the same length have the same paths below
Two words are R-equivalent iff they are Nerode-equivalent
![Page 15: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/15.jpg)
Minimal overlap walking automaton for H = [AT][CG][AC]
Clumps: [AT][CG]A, [AT][CG]C, [AT][CG]A[CG]A, [AT][CG]A[CG]C,….
0
3 4
[AT][CG]A [AT][CG]C
[CG]A
Minimal pattern matching automatonfor H = [AT][CG][AC]
[AT]
[CG]
A C
0
1
2
3 4
[CG]C
![Page 16: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/16.jpg)
Efficiency demonstrating examplesEfficiency demonstrating examples
• H = LXDXLXD[DLE] (amino acid alphabet)PatAut: 40841 states and 81681 edgesR-minimal PatAut: 25 states and 59 edgesMinimal OWA: 6 states and 45 edges
• H = AXXXXCATTA (DNA alphabet )PatAut: 1622 states and 3243 edgesR-minimal PatAut: 64 states and 140 edgesMinimal OWA: 2 states and 3 edges
![Page 17: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.](https://reader036.fdocuments.in/reader036/viewer/2022082408/5a4d1b607f8b9ab0599ad30f/html5/thumbnails/17.jpg)
MerciMerci