Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
-
Upload
tyrone-booker -
Category
Documents
-
view
219 -
download
0
Transcript of Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Real time pattern matching
Porat BennyPorat Ely
Bar-Ilan University
Pattern Matching
Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P.
T=
P=
Online pattern matching
We get the text character by character
=P
Outline
Motivation
Presentation of 3 online models
Space lower bound
A black box algorithm
Exact and approximate pattern matching in the streaming model
Motivation…
Monitoring internet traffic
Motivation…
Stock market
Motivation..
Espionage
Motivation…
Viruses and malware
3 online models
Read only memory
Working memory
Secondm, for saving the pattern
O(poly(log(m))
third0, we can’t save the pattern
O(poly(log(m))
First m, for saving the pattern
O(m)
Space lower bound (deterministic)
Assume algorithm A, use o(m) space for solving the online pattern matching problem
Alice Bob
A
s1,s2,s3…. smS=
S A
Run over all the string Q = q1,q2,…qm. and insert Q, as the text for A.
AQ
Q = S
match
A black box for online approximate pattern matching
Raphaël CliffordBenny PoratEly Porat
CPM 2008
Black box for the First model
Read only memory
Working memory
Firstm, for saving the pattern
O(m)
Problem definition
There are a lot of offline pattern matching algorithms.
We want to find a black box algorithm, that takes most offline pattern matching algorithms and converts them to be pseudo real time.
pseudo real time – take the best time of the offline algorithm,
divide it by nAnd this is bound the time per character.
Not Amortized!!
Result In example, we can applied our
algorithm to the flowing problem Hamming norm K-mismatch Matching under L2
Matching under L1
Online Convolution . .
2logO m logO m m
log logO k k m
2logO m
logO m m
2logO m
Exact And Approximate Pattern Matching In The Streaming Model
Porat BennyPorat Ely
FOCS 2009
solution for the third model
Read only memory
Working memory
third0, we can’t save the pattern
O(poly(log(m))
Pattern Matching
Pattern Matching up to k mistake
It’s not minor!
Cache Work much faster then the Ram Now it’s can fit!
Anti virus on routers
Researchers thought that there is a lower bound and it can't be done.
Randomized algorithm (RK)
1 21, 1 2( ) ...m m
i i m i i i mT t r t r t
pm-1,…p2,p1,p0
t1,t2,t3, … ,ti+1,ti+2 ,…tm, , … tn
2 10 1 2 1( ) ... m
mp p p r p r p r
1 2, 1 1 1( ) ...m mi i m i i i mT t r t r t
How can I calculate from without remembering ti ???
1( )iT ( )iT
ti tm+1
All the calculation in Fq
Streaming pattern matching
P= Z
ZT
Signature
Start signing
Signature
The pattern start with z, and there is no more z's in the pattern
Z
Signature
Start signing
No Z
P= U
UT
Signature
Start signing
Signature
There is a prefix U s.t U appear only once in the pattern
U
Signature
Start signing
m<=m/2Seek in recursion
No small U
P= U
Look on the first m/2 characterThey appear again somewhere
U
P= v v v v v v v v
Prefix of v
Option 1
Option 2
P= v v v v w
w isn't a prefix of vand v isn't a prefix of w
v=<m/2
Solving this case
Option 2
P= v v v v wv=<m/2
Search in recursion for v, and count how many time you found it
Sign on w
T v v v v
Start signing
Signature
v
Solving this case - continue
Option 2
P= v v v v wv=<m/2
Search in recursion for v, and count how many time you found it
Sign on w
T v v v v
Start signing
Signature
v
Using O(log m) signatures and counters in the worst caseTime = O(log m) in the worst case
v v v
>m/2
<m/2Signature
Start signing
Pattern Matching up to k mistake
1 – mismatch
Pattern Matching up to k mistake
Chinese Remainder Theorem
Lets n and m be two coprimes.
a mod n=b mod n a mod m= b mod m
a mod nm=b mod nm
1-mismatch
p1,p2,p3, … pm
p1,p3,p5 …
p2,p4,p6 …
p1,p4,p7 …
p2,p5,p8 …
p3,p6,p9 …
mod 2
mod 3
q1q2q3 . ..q l s . t ∏i=1
l
qi≥m
1-mismatch
p1,p3,p5 …
p2,p4,p6 …
t1,t3,t5 …
t2,t4,t6 …
p1,p3,p5 …
p2,p4,p6 … mod 2
p1,p4,p7 …
p2,p5,p8 …
p3,p6,p9 …
mod 3
Overall sum of all primes
1-mismatch
p1,p3,p5 …
p2,p4,p6 …
t1,t3,t5 …
t2,t4,t6 …
p1,p3,p5 …
p2,p4,p6 … mod 2
Problem
p1,p3,p5 …
p2,p4,p6 …
t1,t3,t5 …
t2,t4,t6 …
p1,p3,p5 …
p2,p4,p6 … mod 2
p1,p3,p5 …
t2,t4,t6 …
When we compare?
For each qi we will start to compare for each alignment 0≤σ≤q i
Space complexity
For each qi we run qi time our algorithm for each alignment.
For each alignment we run again qi
time for each shift.
Overall:
m
mOmq
l
oii loglog
loglog
42
Time complexity
Each character go to just one alignment for each shift.
Overall: ∑i=o
l
q i logm∈O log 3mloglog m
1-mismatch
Lemma1 There is exactly one mismatch
There is exactly one subpattern in each group that not match.
C.R.T
Pattern Matching up to k mistake
Group testing/ Random selector…
A black box for online approximate pattern matching
Raphaël CliffordBenny PoratEly Porat
CPM 2008
The idea
We will split the pattern to log(m) consecutive subpattern
p1, p2, p3, … pm-3, pm-2, pm-1, pm
pm
p1, p2, p3, … pm/2
pm-6,pm-5,pm-4,pm-3
pm-2 ,pm-1
P1
P2
P4
Pm/2
Bring it online
Let look on subpattern with length m’=>Pm’
When we got to the i’th character of the text, to where is Pm’ align?
Conclusion 1 We need to know DIFF(Pm’,T(i-m’,i)) just at position
i+m’ of the text.
ti
pmpm-1 pm-2…Pm’…
m
m’ m’-1
…
The idea…
For each subpattren of length m’. we partition the text to overlap substring
of length 2m’
m’ m’m’m’m’m’
2m’ 2m’
2m’
2m’
2m’
The idea…
For each subpattren of length m’ we run the offline algorithm on each partition of
the text separately.
This ensure us, that we got the difference on time.
ti
If i=2lm’ or 2lm’+m’ for some l
run the offline algorithm on the last 2m’ character.
m’
2m’
We will got all the differences for this section
Running Time T(n,m)=nT(m) – the running time of the
offline algorithm For each subpattern of length m’
We got overlap partition. total time for each subpattrn:
Total time:
' 1
n
m
( ) 2 ' ( ') ( ( '))'
nO m T m O nT mm
log 1
1( , 2 )
m j
jO T n
The problem
We saw, that overall the time is good But,
2m’ = m
2m’ = m
2tm’+m’
m’ = m/2 Pm/2 m’ = m/2
ti
2(t+1)m’
We must wait until the run of the offline algorithm on Pm/2 and the last m character to finish, before we can return the answer for. => (m/2)T(m) time!
The solution
We will split the text to partition of length 1.5m’
m’ m’m’m’m’
1.5m’
1.5m’
m’
The solution…
The latest we will get DIFF(Pm’,Ti-m’,i) will be at index i+m’/2
And by Conclusion 1, we can wait m’/2 character, before we will need this difference.
Conclusion 1.We need to know DIFF(Pm’,Ti-m’,i) just atposition i+m’ of the text.
Spreading the work
So, we can spread the work over the next m’/2 character.
m’/2 m’/2 m’/2 m’/2 m’/2 m’/2
P1 P2 P3
Work on p1
Work on p2
Work on p3
Need to know the difference of P1
Spreading the work…
Overall, we can spread the work for a specific subpattern equivalently between all the character of the text.
All we left to do, is to check that the running time, not change.
Running Time T(n,m)=nT(m) – the running time of the
offline algorithm For each subpattern of length m’
Now, We got overlap partition. total time for each subpattrn:
Total time for all the text:
'/ 2
n
m
( ) 2 ' ( ') ( ( '))'
nO m T m O nT mm
log 1
1( , 2 )
m j
jO T n
Not change!
Running Time…
By spreading the work we got total running time for each character
log 1
1( , 2 ) /
m j
jO T n n
conclusion
We give a space lower bound for deterministic online pattern matching
We give a black box algorithm that can adapt any offline algorithm to online algorithm, using only O(m) space and take time per character.
log 1
1( , 2 ) /
m j
jO T n n