Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub,...

17
Minimized compact automaton Minimized compact automaton for clumps over degenerate for clumps over degenerate patterns patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical problems in biology, Russia

description

Challenge Problem. Creating of an efficient algorithm of pattern occurrences P-value computation. Functional fragments recognition in biological sequences can be reduced to finding of overrepresented occurrences of a pattern. P-value A measure of overrepresentation is P-value of pattern occurrences

Transcript of Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub,...

Page 1: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

Minimized compact automaton for Minimized compact automaton for clumps over degenerate patternsclumps over degenerate patterns

Evgeniia Furletova*, Jan Holub, Mireille Regnier

November 27, 2015

Institute of mathematical problems in biology, Russia

Page 2: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

CollaboratorsCollaborators

Mireille RegnierEcole Polytechnique, INRIA, France

Jan HolubCzech Technical University in Prague, Czech Republic

Page 3: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

ChallengeChallenge

Problem. Creating of an efficient algorithm of pattern occurrences P-value computation.

Functional fragments recognition in biological sequences can be reduced to finding of overrepresented occurrences of a pattern.

A measure of overrepresentation is PP-value-value of pattern occurrences

Page 4: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

PP-value of pattern occurrences-value of pattern occurrences

P-valueP-value isis the probability to find at least one occurrence of words the probability to find at least one occurrence of words from a pattern H in a random sequence of length n generated from a pattern H in a random sequence of length n generated according to a given probability model.according to a given probability model.

'

11 ,(1 ( ))

nP valueC

•C(z) – generating function of clumps; •ρ – closest to 1 root of 1 – z+C(z) = 0

Regnier M., Fang B, Iakovishina D. Clump Combinatorics, Automata, and Word Asymptotics// Proceedings of the Eleventh Workshop on Analytic Algorithmics and Combinatorics (ANALCO). 2014

For a Bernoulli model P-value can be approximated by the formula* :

Page 5: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

ClumpsClumps

ExamplesExamples of clumps for pattern ACATTACA

k-clump for a pattern H = {h1,…,hr} is a string s such that:• s consists of k overlapping occurrences of H• any two consecutive letters of s belong to an occurrence of H

• ACATTACA 1-clump

ACATTACA ACATTACA ACATTACA

• ACATTACATTACACATTACA 3-clump

Page 6: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

Our goal is to create an efficient method for computation of probabilities of k-clumps

0 1( ) ... ,nnC z p p z p z

Clumps generating functionClumps generating function

pk – sum of probabilities of all k-clumps.

Page 7: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

Degenerate (intermediate) patternsDegenerate (intermediate) patterns

Degenerate alphabet Σ’ – alphabet letters of which are subsets of alphabet Σ.Degenerate pattern is a string in Σ’

Example:Example: IUPAC alphabet

A = [A]C = [C]G = [G]T = [T]R = [AG]Y = [CT]S = [CG] …N = [ACGT]

Examples:Examples: IUPAC consensuses

ТАТА-box ТAТA[AТ]A[AТ] – 4 words of length 7

Consensus of transcription factor binding site Antp (Drosophila) ANNNNCATTA – 256 words of length 10

Page 8: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

Pattern matching (Aho-Corasick) automaton for degenerate pattern H = A[CT]A

0

1

2 3

4 5

A

C T

A A

Page 9: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

0

1

2 3

4 5

A

C T

A

Pattern matching (Aho-Corasick) automaton for degenerate pattern H = A[CT]A

Clumps: ACA, ATA, ACACA, ACATA,….

A

Page 10: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

Overlap walking automaton* for H = A[CT]A

0

1

2 3

4 5

A

C T

A A

Clumps: ACA, ATA, ACACA, ACATA,….

0

ACA

5

ACA ATA

CA TA

CA

TA

* Regnier M., 2014

Pattern matching automaton for H = A[CT]A

Overlap walking automaton Overlap walking automaton

4

ATA

Page 11: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

We propose a minimization of overlap walking automaton for degenerate patterns

Page 12: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

Pattern matching automata minimization degenerate pattern H = [AT][CG][AC]

Page 13: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

[AT]

[CG]

[A] [C]

0

1

2

3 4

Minimal pattern matching automaton degenerate pattern H = [AT][CG][AC]

This automaton can be constructed in linear time of its states

Page 14: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

RR-equivalence-equivalence

Nodes x and y are R-equivalent (x R~ y) iff x = y or1. |x|=|y|;2. suffix_link(x) R~ suffix_link(y).

For degenerate patterns, the nodes of the same length have the same paths below

Two words are R-equivalent iff they are Nerode-equivalent

Page 15: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

Minimal overlap walking automaton for H = [AT][CG][AC]

Clumps: [AT][CG]A, [AT][CG]C, [AT][CG]A[CG]A, [AT][CG]A[CG]C,….

0

3 4

[AT][CG]A [AT][CG]C

[CG]A

Minimal pattern matching automatonfor H = [AT][CG][AC]

[AT]

[CG]

A C

0

1

2

3 4

[CG]C

Page 16: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

Efficiency demonstrating examplesEfficiency demonstrating examples

• H = LXDXLXD[DLE] (amino acid alphabet)PatAut: 40841 states and 81681 edgesR-minimal PatAut: 25 states and 59 edgesMinimal OWA: 6 states and 45 edges

• H = AXXXXCATTA (DNA alphabet )PatAut: 1622 states and 3243 edgesR-minimal PatAut: 64 states and 140 edgesMinimal OWA: 2 states and 3 edges

Page 17: Minimized compact automaton for clumps over degenerate patterns Evgeniia Furletova*, Jan Holub, Mireille Regnier November 27, 2015 Institute of mathematical.

MerciMerci