Property Matching and Weighted Matching

32
Property Matching and Property Matching and Weighted Matching Weighted Matching Amihood Amir, Amihood Amir, Eran Chencinski Eran Chencinski , Costas , Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang Iliopoulos, Tsvi Kopelowitz and Hui Zhang

description

Property Matching and Weighted Matching. Amihood Amir, Eran Chencinski , Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang. Results. Weighted Matching. General Reduction. Property Matching. Property Indexing. Pattern Matching. Property Matching. - PowerPoint PPT Presentation

Transcript of Property Matching and Weighted Matching

Page 1: Property Matching and Weighted Matching

Property Matching and Property Matching and Weighted MatchingWeighted Matching

Amihood Amir, Amihood Amir, Eran ChencinskiEran Chencinski, Costas , Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang Iliopoulos, Tsvi Kopelowitz and Hui Zhang

Page 2: Property Matching and Weighted Matching

ResultsResults

Weighted MatchingWeighted Matching

Property MatchingProperty Matching

Pattern MatchingPattern Matching

General Reduction

Property Indexing

Page 3: Property Matching and Weighted Matching

Property MatchingProperty Matching

Def:Def: A A property property of a string T = tof a string T = t11, …, t, …, tnn is a is a set of intervals {(sset of intervals {(s11, f, f11), (s), (s22, f, f22), … , (s), … , (stt, f, ftt)}, )},

s.t. ss.t. sii, f, fi i {1, … , n} and s{1, … , n} and sii ≤ f ≤ fii

Property Matching ProblemProperty Matching Problem

Given a text T with property and a pattern P,Given a text T with property and a pattern P,

Find all locations where P matches T Find all locations where P matches T andand is is fully contained in an interval in .fully contained in an interval in .

Page 4: Property Matching and Weighted Matching

Property Matching - Property Matching - ExampleExample

Property Swap Matching ProblemProperty Swap Matching Problem

A A A D B B A DB D B A D B D

A D B

Page 5: Property Matching and Weighted Matching

Property MatchingProperty Matching

Solving Property Matching ProblemSolving Property Matching Problem

• Solve regular pattern matching problemSolve regular pattern matching problem

• Eliminate results not in property intervalEliminate results not in property interval

• Eliminating results can be done in linear timeEliminating results can be done in linear time

• If regular problem takes If regular problem takes ΩΩ(n) time => (n) time =>

Property matching time = regular problem Property matching time = regular problem timetime

Page 6: Property Matching and Weighted Matching

Property Indexing ProblemProperty Indexing Problem

Property Indexing ProblemProperty Indexing Problem

• Preprocess T s.t. given a P find occurrences of Preprocess T s.t. given a P find occurrences of P in T s.t. P is contained in a property intervalP in T s.t. P is contained in a property interval

• Time: proportional to |P| and tocc Time: proportional to |P| and tocc

• Our solution:Our solution: Query time O(|P| log| Query time O(|P| log|ΣΣ| + tocc )| + tocc )

Preprocessing of O(n log|Preprocessing of O(n log|ΣΣ| + n * log log n)| + n * log log n)

Page 7: Property Matching and Weighted Matching

Weighted SequenceWeighted Sequence

Def 1:Def 1: weighted sequence weighted sequence is sequence of is sequence of sets of pairs where and sets of pairs where and is probability of having symbol at is probability of having symbol at location i. location i. <A,1/2><B,3/8>

<C,1/8>

<A,1/3><B,1/3>

<D,1/3>

<A,1/4><C,3/4>

<D,1><B,1/2><C,1/2>

<B,1/9><C,8/9>

Page 8: Property Matching and Weighted Matching

Weighted SequenceWeighted Sequence

Def 2:Def 2: Given prob Given prob εε, P=p, P=p11,…,p,…,pmm occurs at occurs at location i of weighted text T w.p. at least location i of weighted text T w.p. at least εε if: if:

Page 9: Property Matching and Weighted Matching

Weighted SequenceWeighted Sequence

<A,1/2><B,3/8>

<C,1/8>

<A,1/3><B,1/3>

<D,1/3>

<A,1/4><C,3/4>

<D,1><B,1/2><C,1/2>

<B,1/9><C,8/9>

A D C C

Page 10: Property Matching and Weighted Matching

GoalGoal

• Weighted Matching problems = Pattern Weighted Matching problems = Pattern Matching problems with weighted text.Matching problems with weighted text.

• Goal: Find general reduction for solving Goal: Find general reduction for solving weighted matching problems using weighted matching problems using regular pattern matching algorithms.regular pattern matching algorithms.

Page 11: Property Matching and Weighted Matching

Naive AlgorithmNaive Algorithm

Algorithm AAlgorithm A

1.1. Find all possible patterns appearing in Find all possible patterns appearing in weighted text.weighted text.

2.2. Concatenate all patterns to create new Concatenate all patterns to create new text.text.

3.3. Run Run regularregular pattern matching algorithm pattern matching algorithm on new on new regularregular text. text.

4.4. Check each pattern found for prob. ≥ Check each pattern found for prob. ≥ εε..

Page 12: Property Matching and Weighted Matching

Naive AlgorithmNaive Algorithm<A,1/2><B,3/8>

<C,1/8>

<A,1/3><B,1/3>

<D,1/3>

<A,1/4><C,3/4>

<D,1><B,1/2><C,1/2>

<B,1/9><C,8/9>

AAAA

AAAA

AAAA

DDDD

BBCC

BCBC

AAAA

AAAA

CCCC

DDDD

BBCC

BCBC

A B A D B B

Page 13: Property Matching and Weighted Matching

Naive AlgorithmNaive Algorithm

• Clearly this algorithm is inefficient and Clearly this algorithm is inefficient and can be exponential even for |can be exponential even for |ΣΣ|=|=2. 2.

• Notice that there is a lot of waste:Notice that there is a lot of waste:– Many patterns share same substrings. Many patterns share same substrings. – Given Given εε, we can ignore patterns w.p. < , we can ignore patterns w.p. < εε..

Page 14: Property Matching and Weighted Matching

Maximal FactorMaximal Factor

Def 3:Def 3: Given Given εε, weighted text T, string X is , weighted text T, string X is maximal factor maximal factor of T at location i if:of T at location i if:

(a) X appears at location i w.p. ≥ (a) X appears at location i w.p. ≥ εε

(b) if we extend X with 1 character to right (b) if we extend X with 1 character to right or left – the probability drops below or left – the probability drops below εε..

Page 15: Property Matching and Weighted Matching

Maximal FactorMaximal Factor

<A,1/2><B,3/8>

<C,1/8>

<A,1/3><B,1/3>

<D,1/3>

<A,1/4><C,3/4>

<D,1><B,1/2><C,1/2>

<B,1/9><C,8/9>

A C D B

Page 16: Property Matching and Weighted Matching

Algorithm BAlgorithm B

Algorithm BAlgorithm B

1.1. Find all maximal factors in text.Find all maximal factors in text.

2.2. Concatenate factors to create new text.Concatenate factors to create new text.

3.3. Run Run regularregular pattern matching algorithm pattern matching algorithm on new on new regularregular text. text.

Note: A pattern appearing in new text has Note: A pattern appearing in new text has prob. of appearance ≥ prob. of appearance ≥ εε..

Page 17: Property Matching and Weighted Matching

Total Length of Maximal Total Length of Maximal FactorsFactors

What is total length of all maximal factors?What is total length of all maximal factors?

Consider the following case:Consider the following case:

<A,1-δ><B, δ>

<A,1-δ><B, δ>

<C,1><A,1-δ>

<B,δ><A,1-δ><B,δ>

<C,1>

such that (1-such that (1-δδ))n/3n/3 = = εε. . n/3 maximal factors of length 2/3*nn/3 maximal factors of length 2/3*n Total length of all maximal factors is Total length of all maximal factors is ΩΩ(n(n22).).

Page 18: Property Matching and Weighted Matching

Classifying Text LocationsClassifying Text Locations

Given Given εε, we classify location i of weighted text , we classify location i of weighted text into 3 categories:into 3 categories:

• Solid positions:Solid positions: one character w.p. exactly one character w.p. exactly 1.1.

• Leading positions:Leading positions: at least one character at least one character w.p. greater than 1-w.p. greater than 1-εε (and less than 1). (and less than 1).

• Branching positions:Branching positions: all characters have all characters have probability of appearance at most 1-probability of appearance at most 1-εε..

Page 19: Property Matching and Weighted Matching

Classifying Text LocationsClassifying Text Locations

<A,1/2><B,3/8>

<C,1/8>

<A,1/3><B,1/3>

<D,1/3>

<A,1/4><C,3/4>

<D,1><B,1/3><C,2/3>

<B,1/9><C,8/9>

If If εε ≤ 1/2, at most 1 ≤ 1/2, at most 1 “eligible” character “eligible” character at leading positionat leading position

Page 20: Property Matching and Weighted Matching

LST TransformationLST Transformation

Def 4:Def 4: The The Leading to Solid Leading to Solid Transformation Transformation of weighted text T=tof weighted text T=t11,,…,t…,tnn, LST(T)=t’, LST(T)=t’11,…,t’,…,t’nn is: is:

where leading character has prob. of app. ≥ max{1-where leading character has prob. of app. ≥ max{1-εε, , εε}}

Page 21: Property Matching and Weighted Matching

<A,1/2><B,3/8>

<C,1/8>

<A,1/3><B,1/3>

<D,1/3><C,1> <D,1>

<B,1/3><C,2/3>

<C,1>

LST TransformationLST Transformation

<A,1/2><B,3/8>

<C,1/8>

<A,1/3><B,1/3>

<D,1/3>

<A,1/4><C,3/4>

<D,1><B,1/3><C,2/3>

<B,1/9><C,8/9>

Page 22: Property Matching and Weighted Matching

Extended Maximal FactorExtended Maximal Factor

Def 5:Def 5: X is an X is an extendedextended maximal factor maximal factor of T if X is an maximal factor of LST(T).of T if X is an maximal factor of LST(T).

<A,1> <A,1><C,1> <A,1> <A,1><C,1><A,1-δ><B,δ>

<A,1-δ><B,δ>

<C,1><A,1-δ><B,δ>

<A,1-δ><B,δ>

<C,1>

Page 23: Property Matching and Weighted Matching

Lemma 1Lemma 1

Lemma 1:Lemma 1: Total length of all extended Total length of all extended maximal factors is at most O(nmaximal factors is at most O(n∙∙(1/(1/εε))2 2

log(1/log(1/εε)).)).

Corollary:Corollary: For constant k, total length of all For constant k, total length of all extended maximal factors is linear.extended maximal factors is linear.

Page 24: Property Matching and Weighted Matching

Lemma 1Lemma 1

Why can we assume constant Why can we assume constant εε??

• In practice: want patterns that appear with In practice: want patterns that appear with noticeable probabilities e.g. 90%, 50% or 20%.noticeable probabilities e.g. 90%, 50% or 20%.

• Finding patterns w.p. at least 20% => 1/Finding patterns w.p. at least 20% => 1/εε=5.=5.

• Smaller percentage = smaller Smaller percentage = smaller εε, rarely in , rarely in practice.practice.

Page 25: Property Matching and Weighted Matching

Proof of Lemma 1Proof of Lemma 1

Case 1:Case 1: εε > 1/2, search patterns w.p. > 1/2, search patterns w.p. >> 50%. 50%.

Obv:Obv: At each location at most 1 char w.p. > At each location at most 1 char w.p. > 50%.50%. Total length of all factors is ≤ n.Total length of all factors is ≤ n.

For rest of proof we assume For rest of proof we assume εε ≤ 1/2. ≤ 1/2.

Page 26: Property Matching and Weighted Matching

Proof of Lemma 1Proof of Lemma 1

Claim 1:Claim 1: A (extended) maximal factor A (extended) maximal factor passes by at most O((1/passes by at most O((1/εε)∙)∙log(1/log(1/εε)) )) branching positions.branching positions.

Proof:Proof: Denote l Denote lbb = max. # of branching = max. # of branching position passed. In a branching position position passed. In a branching position all characters have prob. of appearance ≤ all characters have prob. of appearance ≤ 1-1-εε : :

Page 27: Property Matching and Weighted Matching

Proof of Lemma 1Proof of Lemma 1

Claim 2:Claim 2: At most extended maximal At most extended maximal factors start at each location. factors start at each location.

Intuition:Intuition:<A1,ε><A2,ε>

<A1/ε,ε>

<B,1> <C,1>

<A1,1/2><A2,1/2>

<C,1>

<B1, 2ε><B2, 2ε>

<B1/2ε,2ε>

Page 28: Property Matching and Weighted Matching

Proof of Lemma 1Proof of Lemma 1

Claim 1:Claim 1: A (extended) maximal factor A (extended) maximal factor passes by ≤ O((1/passes by ≤ O((1/εε)) log(1/log(1/εε)) branching )) branching positions.positions.

Claim 2:Claim 2: At most extended maximal At most extended maximal factors starting at each location.factors starting at each location.

Corollary:Corollary: each location is in ≤ O((1/ each location is in ≤ O((1/εε))2 2

log(1/log(1/εε)) extended maximal factors.)) extended maximal factors.

Page 29: Property Matching and Weighted Matching

Proof of Lemma 1Proof of Lemma 1

There are lThere are lbb starting locations, from each location starting locations, from each location there are ≤ extended maximal factors.there are ≤ extended maximal factors.

Corollary:Corollary: each location is in ≤ O((1/ each location is in ≤ O((1/εε))2 2

log(1/log(1/εε)) extended maximal factors.)) extended maximal factors.

Page 30: Property Matching and Weighted Matching

Finding Extended Maximal Finding Extended Maximal FactorsFactors

Algorithm for finding extended maximal Algorithm for finding extended maximal factors:factors:

1.1. Transform T to LST(T)Transform T to LST(T)

2.2. Find all maximal factors in LST(T) by:Find all maximal factors in LST(T) by:(a) At each starting location try to extend until (a) At each starting location try to extend until

the prob. drops below the prob. drops below εε..

(b) Backtrack to previous branching position and (b) Backtrack to previous branching position and try to extend the factor and so on ...try to extend the factor and so on ...

Run time:Run time: linear in the output length. linear in the output length.

Page 31: Property Matching and Weighted Matching

Framework for Solving Framework for Solving Weighted Matching ProblemsWeighted Matching Problems

Solving Weighted Matching Problems:Solving Weighted Matching Problems:

1.1. Find all extended maximal factors of T.Find all extended maximal factors of T.

2.2. Concatenate factors (add $’s betw) to get Concatenate factors (add $’s betw) to get T’.T’.

3.3. Compute property by extending Compute property by extending probabilities until below probabilities until below εε

4.4. Run property algorithm on text T’ with .Run property algorithm on text T’ with .

Page 32: Property Matching and Weighted Matching

ConclusionsConclusions

• Our framework yields:Our framework yields:– Solutions to unsolved weighted matching Solutions to unsolved weighted matching

problems (scaled, swaped, param. matching, problems (scaled, swaped, param. matching, indexing)indexing)

– Efficient solutions to others (exact and approx.)Efficient solutions to others (exact and approx.)

• For constant For constant εε::– Weighted matching problems can be solved in Weighted matching problems can be solved in

same running times as regular pattern matchingsame running times as regular pattern matching– Weighted ndexing can be solved in same times Weighted ndexing can be solved in same times

except for O(n log log(n)) preprocessingexcept for O(n log log(n)) preprocessing