Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University.
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University.
Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P.
Problem definition - Pattern Matching
T=
P=
n
m
Motivation…
• Viruses and malware
Software solutions:Snort: 73.5MbClamAV: 1.48Gb
Using TCAMs:Snort: 680KbClamAV: 25Mb
Our solution (software):Snort: 51KbClamAV: 216Kb
Streaming model
250 BPS 250 BPS
We can't store the whole input
In our case we seek for algorithm which require poly(log m) space
Related work
• Karp-Rabin: Randomized Algorithm for exact pattern matching
• Clifford, Porat, and Porat: A black box algorithm for online approximate pattern matchingo Almost any pattern matching algorithm can be converted to
run online.
p0p1p2p3...pm-1
Karp-Rabin Algorithm
t0 t1 t2 . . . ti ti+1 . . . ti+m-1 ti+m . . . tn
p0rm-1+p1rm-2+p2rm-3+...+pm-1modq
Si=tirm-1+ti+1rm-2+...ti+m-1modq
Si+1=ti+1rm-1+...ti+m-1r+ti+mmodq
Si+1=Sir+ti+m-tirm
Require O(m) memory
Choosing randomly r
The idea - Simple case
P= Z
ZT
Signature
Start signing
Signature
The pattern start with z, and there is no more z's in the pattern
Z
Signature
Start signing
Case 1
P= U
UT
Signature
Start signing
Signature
There is a prefix U s.t U appear only once in the pattern
U
Signature
Start signing
m<=m/2
Seek in recursion
Case 2: No small U
P= W
Look on the first m/2 characterThey appear again somewhere
W
P= v v v v v v v v
Prefix of v
Option 1
Option 2
P= v v v v w
w isn't a prefix of vand v isn't a prefix of w
v=<m/2
Solving case 2
Option 2
P= v v v v w
v=<m/2
Search in recursion for v, and count how many time you found it
Sign on w
T v v v v
Start signing
Signature
v
Signature
Start signing
Solving case 2 - continue
Option 2
P= v v v v w
v=<m/2
Search in recursion for v, and count how many time you found it
Sign on w
T v v v v
Start signing
Signature
v
Using O(log m) signatures and counters in the worst case
v v v
>m/2
<m/2Signature
Start signing
p0p1p2p3...pm-1
Karp-Rabin Algorithm
t0 t1 t2 . . . ti ti+1 . . . ti+m-1 ti+m . . . tn
p0rm-1+p1rm-2+p2rm-3+...+pm-1modq
Si=tirm-1+ti+1rm-2+...ti+m-1modq
Si+1=ti+1rm-1+...ti+m-1r+ti+mmodq
Si+1=Sir+ti+m-tirm
Choosing randomly r
p0p1p2p3...pm-1
Rothschild signature 07
p0rm-1+p1rm-2+p2rm-3+...+pm-1modq
p0+p1r+p2r2+...+pm-1rm-1modq
t0 t1 t2 t3 . . . ti
qrtSi
j
jji mod
0
Forward signatures
P= U
UT
Signature
Calculate X=Si+Sig*ri+1
Signature
There is a prefix U s.t U appear only once in the pattern
m<=m/2
Seek in recursion
Check if equal to XRemember X for this position
0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,10, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1
Example: q=7 r=3
0, 1, 1, 0, 1, 1, 1
0, 1, 1
P:
T: 0
Level 1:Level 2:Level 3:
1 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 1 0 1 1 0 1 00 1 1 0 1 1 0 1 1 0 1 1 0 1 10 1 1 0 1 1 1
5
1
4
0 3 3
ri=
2 6 6 6 2 4 4 1 6 0 0 0 0 0 0 1 4 4 6 3 3 1 1
32645132164513264513264513
5 6 3 4 6 1 3 2 6 4 3 00 1
1
Level 3:Level 2:Level 1:
Worst case - time
t0 t1 t2 t3 . . . ti
X1
X2
Xlogm
Check using hash table
X1=X2=…=Xlogm ???
We can work in lazy approach without blowup in the memory
Time: O(1)
Amortized O(1), but what about worst case?
Multi-Pattern search (dictionary matching)
• Given a set of patterns D={P1,P2,P3,…,Pd}– The patterns can be of different length
• We will want to report whenever one of the patterns appear.
• Our algorithm will require O(∑i=1dlog|Pi|)
memory, and will require O(log d) time per text character.
Multi-Pattern search (dictionary matching)
• Denote M=maxi |Pi|
• Our algorithm will have 2 cases:– Case 1: d>M– Case 2: d<M
Case 1: d>M
• In this case we can allocate an array of size M+1
t0 t1 t2 t3 . . . tl-M tl-M+1 . . . tl
Sl-MSl-M+1 . . . Sl
qrtSi
j
jji mod
0
It is easy to maintain such a sliding window in O(1) time and O(M) memory
Case 1: d>M - continueqrxxxxxSig
i
j
jji mod)...(
0210
For each Pi in D: (Pi=a0 a1 a2 … ami-1) e=mi
while e!=0:find j s.t 2j=<e and 2j+1>ee=e-2j
if e!=0 HashTable(Sig(aeae+1…ami))
HashTable(Sig(a0a1…ami),matchi)
Example
Pi=a0 a1 a2 … a38
We will store in the hash table:
Sig(a7a6…a38)
Sig(a3a4…a38)
Sig(a1,a2…a38)
Sig(a0a1…a38),matchi
We will store at most log |Pi| points
Case 1: d>M
• In this case we can allocate an array of size M+1
t0 t1 t2 t3 . . . tl-M tl-M+1 . . . tl
Sl-MSl-M+1 . . . SlqrtSi
j
jji mod
0
Notice that it take O(1) to calculate Sig(titi+1…tl)
qrxxxxxSigi
j
jji mod)...(
0210
iil
lii r
SStttSig 1
1 )...(
Case 1: d>M - continue
We will do binary search over the sliding window
Sl-M Sl-M+1 . . . Sl
l-2j
Is it in the HashTable?
j
j
l
ll
r
SS2
12
No
l-2j-1
Is it in the HashTable?
1
1
2
12
j
j
l
ll
r
SS
Yes
l-2j-1-2j-2
Is it in the HashTable?
21
21
22
122
jj
jj
l
ll
r
SS
Case 2: d<M
• In this case we will split our dictionary D into 2 dictionaries:– D1 – all the patterns shorter then d.
On this dictionary we will run case 1.
– D2 – all the patterns longer then d.We need only to deal with this case.
Case 2: d<M - continue
For each Pi in D2:
Pi = a0 a1 a2 . . . ad-1 ad . . . am
SPi=Sig(a0a1…ad-1)
Store in hash table SPi
Case 2: d<M - continue
If Pi contain a period prefix of length more then d
Pi = u u u u u u v . . am
SPi SPiSPi
We store as well the number of time we need to see SPi
w.h.p won’t be SPi
We will start a process which will seek for Pi only after seeing enough SPi.Therefore the minimum number of characters we have to see between 2 process of Pi is at least d.
Case 2: d<M - continue
• We run the algorithm from the beginning of the lecture.
• Amortized it take O(1/d) per pattern per text character.
• Overall it take O(1) amortized time per text character.
• By lazy approach we get O(1) time in worst case.
Open problems
• Multi pattern search case 2 takes O(1) time, however case 1 takes O(logd)– Improve case 1 to be O(1)
– With heuristic almost all the dictionary take O(1) time, and O(1) space per pattern.
• Lower bound– We believe that single pattern search lower bound is
Ώ(log m log δ)
• Find more clients• Find a place for sabbatical (~1/1/2012-30/9/2013)
Important things:• In coming events:
– ICALP2011GT (July 3rd, one day before ICALP)• We will have some support for students
– Workshop on Sparsity and Computation, U. Mich. Aug 1--4
• We will have some support for students
– IMA: Group Testing Designs, Algorithms, and Applications to Biology Feb 13--17
– Stringology 2012
• Find a place for sabbatical (~1/1/2012-30/9/2013)