Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

47
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University

Transcript of Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Page 1: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Real time pattern matching

Porat BennyPorat Ely

Bar-Ilan University

Page 2: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Pattern Matching

Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P.

T=

P=

Page 3: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Online pattern matching

We get the text character by character

=P

Page 4: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Outline

Motivation

Presentation of 3 online models

Space lower bound

A black box algorithm

Exact and approximate pattern matching in the streaming model

Page 5: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Motivation…

Monitoring internet traffic

Page 6: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Motivation…

Stock market

Page 7: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Motivation..

Espionage

Page 8: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Motivation…

Viruses and malware

Page 9: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

3 online models

Read only memory

Working memory

Secondm, for saving the pattern

O(poly(log(m))

third0, we can’t save the pattern

O(poly(log(m))

First m, for saving the pattern

O(m)

Page 10: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Space lower bound (deterministic)

Assume algorithm A, use o(m) space for solving the online pattern matching problem

Alice Bob

A

s1,s2,s3…. smS=

S A

Run over all the string Q = q1,q2,…qm. and insert Q, as the text for A.

AQ

Q = S

match

Page 11: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

A black box for online approximate pattern matching

Raphaël CliffordBenny PoratEly Porat

CPM 2008

Page 12: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Black box for the First model

Read only memory

Working memory

Firstm, for saving the pattern

O(m)

Page 13: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Problem definition

There are a lot of offline pattern matching algorithms.

We want to find a black box algorithm, that takes most offline pattern matching algorithms and converts them to be pseudo real time.

pseudo real time – take the best time of the offline algorithm,

divide it by nAnd this is bound the time per character.

Not Amortized!!

Page 14: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Result In example, we can applied our

algorithm to the flowing problem Hamming norm K-mismatch Matching under L2

Matching under L1

Online Convolution . .

2logO m logO m m

log logO k k m

2logO m

logO m m

2logO m

Page 15: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Exact And Approximate Pattern Matching In The Streaming Model

Porat BennyPorat Ely

FOCS 2009

Page 16: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

solution for the third model

Read only memory

Working memory

third0, we can’t save the pattern

O(poly(log(m))

Pattern Matching

Pattern Matching up to k mistake

Page 17: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

It’s not minor!

Cache Work much faster then the Ram Now it’s can fit!

Anti virus on routers

Researchers thought that there is a lower bound and it can't be done.

Page 18: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Randomized algorithm (RK)

1 21, 1 2( ) ...m m

i i m i i i mT t r t r t

pm-1,…p2,p1,p0

t1,t2,t3, … ,ti+1,ti+2 ,…tm, , … tn

2 10 1 2 1( ) ... m

mp p p r p r p r

1 2, 1 1 1( ) ...m mi i m i i i mT t r t r t

How can I calculate from without remembering ti ???

1( )iT ( )iT

ti tm+1

All the calculation in Fq

Page 19: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Streaming pattern matching

P= Z

ZT

Signature

Start signing

Signature

The pattern start with z, and there is no more z's in the pattern

Z

Signature

Start signing

Page 20: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

No Z

P= U

UT

Signature

Start signing

Signature

There is a prefix U s.t U appear only once in the pattern

U

Signature

Start signing

m<=m/2Seek in recursion

Page 21: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

No small U

P= U

Look on the first m/2 characterThey appear again somewhere

U

P= v v v v v v v v

Prefix of v

Option 1

Option 2

P= v v v v w

w isn't a prefix of vand v isn't a prefix of w

v=<m/2

Page 22: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Solving this case

Option 2

P= v v v v wv=<m/2

Search in recursion for v, and count how many time you found it

Sign on w

T v v v v

Start signing

Signature

v

Page 23: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Solving this case - continue

Option 2

P= v v v v wv=<m/2

Search in recursion for v, and count how many time you found it

Sign on w

T v v v v

Start signing

Signature

v

Using O(log m) signatures and counters in the worst caseTime = O(log m) in the worst case

v v v

>m/2

<m/2Signature

Start signing

Page 24: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Pattern Matching up to k mistake

1 – mismatch

Pattern Matching up to k mistake

Page 25: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Chinese Remainder Theorem

Lets n and m be two coprimes.

a mod n=b mod n a mod m= b mod m

a mod nm=b mod nm

Page 26: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

1-mismatch

p1,p2,p3, … pm

p1,p3,p5 …

p2,p4,p6 …

p1,p4,p7 …

p2,p5,p8 …

p3,p6,p9 …

mod 2

mod 3

q1q2q3 . ..q l s . t ∏i=1

l

qi≥m

Page 27: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

1-mismatch

p1,p3,p5 …

p2,p4,p6 …

t1,t3,t5 …

t2,t4,t6 …

p1,p3,p5 …

p2,p4,p6 … mod 2

p1,p4,p7 …

p2,p5,p8 …

p3,p6,p9 …

mod 3

Overall sum of all primes

Page 28: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

1-mismatch

p1,p3,p5 …

p2,p4,p6 …

t1,t3,t5 …

t2,t4,t6 …

p1,p3,p5 …

p2,p4,p6 … mod 2

Page 29: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Problem

p1,p3,p5 …

p2,p4,p6 …

t1,t3,t5 …

t2,t4,t6 …

p1,p3,p5 …

p2,p4,p6 … mod 2

p1,p3,p5 …

t2,t4,t6 …

When we compare?

For each qi we will start to compare for each alignment 0≤σ≤q i

Page 30: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Space complexity

For each qi we run qi time our algorithm for each alignment.

For each alignment we run again qi

time for each shift.

Overall:

m

mOmq

l

oii loglog

loglog

42

Page 31: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Time complexity

Each character go to just one alignment for each shift.

Overall: ∑i=o

l

q i logm∈O log 3mloglog m

Page 32: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

1-mismatch

Lemma1 There is exactly one mismatch

There is exactly one subpattern in each group that not match.

C.R.T

Page 33: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Pattern Matching up to k mistake

Group testing/ Random selector…

Page 34: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

A black box for online approximate pattern matching

Raphaël CliffordBenny PoratEly Porat

CPM 2008

Page 35: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

The idea

We will split the pattern to log(m) consecutive subpattern

p1, p2, p3, … pm-3, pm-2, pm-1, pm

pm

p1, p2, p3, … pm/2

pm-6,pm-5,pm-4,pm-3

pm-2 ,pm-1

P1

P2

P4

Pm/2

Page 36: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Bring it online

Let look on subpattern with length m’=>Pm’

When we got to the i’th character of the text, to where is Pm’ align?

Conclusion 1 We need to know DIFF(Pm’,T(i-m’,i)) just at position

i+m’ of the text.

ti

pmpm-1 pm-2…Pm’…

m

m’ m’-1

Page 37: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

The idea…

For each subpattren of length m’. we partition the text to overlap substring

of length 2m’

m’ m’m’m’m’m’

2m’ 2m’

2m’

2m’

2m’

Page 38: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

The idea…

For each subpattren of length m’ we run the offline algorithm on each partition of

the text separately.

This ensure us, that we got the difference on time.

ti

If i=2lm’ or 2lm’+m’ for some l

run the offline algorithm on the last 2m’ character.

m’

2m’

We will got all the differences for this section

Page 39: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Running Time T(n,m)=nT(m) – the running time of the

offline algorithm For each subpattern of length m’

We got overlap partition. total time for each subpattrn:

Total time:

' 1

n

m

( ) 2 ' ( ') ( ( '))'

nO m T m O nT mm

log 1

1( , 2 )

m j

jO T n

Page 40: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

The problem

We saw, that overall the time is good But,

2m’ = m

2m’ = m

2tm’+m’

m’ = m/2 Pm/2 m’ = m/2

ti

2(t+1)m’

We must wait until the run of the offline algorithm on Pm/2 and the last m character to finish, before we can return the answer for. => (m/2)T(m) time!

Page 41: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

The solution

We will split the text to partition of length 1.5m’

m’ m’m’m’m’

1.5m’

1.5m’

m’

Page 42: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

The solution…

The latest we will get DIFF(Pm’,Ti-m’,i) will be at index i+m’/2

And by Conclusion 1, we can wait m’/2 character, before we will need this difference.

Conclusion 1.We need to know DIFF(Pm’,Ti-m’,i) just atposition i+m’ of the text.

Page 43: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Spreading the work

So, we can spread the work over the next m’/2 character.

m’/2 m’/2 m’/2 m’/2 m’/2 m’/2

P1 P2 P3

Work on p1

Work on p2

Work on p3

Need to know the difference of P1

Page 44: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Spreading the work…

Overall, we can spread the work for a specific subpattern equivalently between all the character of the text.

All we left to do, is to check that the running time, not change.

Page 45: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Running Time T(n,m)=nT(m) – the running time of the

offline algorithm For each subpattern of length m’

Now, We got overlap partition. total time for each subpattrn:

Total time for all the text:

'/ 2

n

m

( ) 2 ' ( ') ( ( '))'

nO m T m O nT mm

log 1

1( , 2 )

m j

jO T n

Not change!

Page 46: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Running Time…

By spreading the work we got total running time for each character

log 1

1( , 2 ) /

m j

jO T n n

Page 47: Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

conclusion

We give a space lower bound for deterministic online pattern matching

We give a black box algorithm that can adapt any offline algorithm to online algorithm, using only O(m) space and take time per character.

log 1

1( , 2 ) /

m j

jO T n n