Shift-And Approach to Pattern Matching in LZW Compressed Text

32
Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA Setsuo ARIKAWA

description

Shift-And Approach to Pattern Matching in LZW Compressed Text. Takuya KIDA. Masayuki TAKEDA. Ayumi SHINOHARA. Setsuo ARIKAWA. Department of Informatics Kyushu University, Japan. Motivation. The available storage devices are limited! I am eager to stuff any available information - PowerPoint PPT Presentation

Transcript of Shift-And Approach to Pattern Matching in LZW Compressed Text

Page 1: Shift-And Approach  to Pattern Matching in LZW Compressed Text

Shift-And Approach to Pattern Matching

in LZW Compressed Text

Takuya KIDA

Department of InformaticsKyushu University, Japan

Masayuki TAKEDAAyumi SHINOHARA

Setsuo ARIKAWA

Page 2: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<2/32>

E-mail

Address book

Schedule

Dictionary

Phone numbers

Memo

Electronic book

Database

The available storage devices are limited! I am eager to stuff any available information up to possible! I want to do pattern matching as fast as possible!

Motivation

Motivation

...Yes! Data compression!

...but a suffix trie is very large...

Page 3: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<3/32>

CompressedText

OriginalOriginalTextText

CompressedText

Pattern MatchingPattern Matching MachineMachine

New Machine !New Machine !

Our goal

Our goal

decompress

Page 4: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<4/32>

year researchers compression method

1988 Eliam-Tsoreff and Vishkin run-length1992 Amir, Landau, and Vishkin two-dimensional run-length

1995 Farach and Thorup LZ77

1996 Amir, Benson and Farach LZW1997 Karpinski, Rytter, and Shinohara straight-line programs

1996 Gasieniec, et al. LZ77

1997 Miyazaki, Shinohara, and Takeda straight-line programs

1992 Amir and Benson two-dimensional run-lengthAmir, Benson, and Farach1994 two-dimensional run-length

1997 Takeda finite state encoding

1998 Shibata byte pair encoding

1994 Manber original compression scheme

1998 Fukamachi, Shinohara, and Takeda Huffman encoding1998 Kida, et al. LZW

Previous researches

Previous researches

AC automatonDCC’98

Page 5: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<5/32>

year researchers compression method

1999 Kida, Takeda, Shinohara, andArikawa

LZW

1999 Shibata, et al. Byte pair encoding

Kida, et al.1999 Dictionary based methods(Collage system)

1999 Navarro and Raffinot LZ family

1999 Shibata, Takeda, Shinohara, andArikawa

Antidictionaries

CPM’99

CPM’99

CPM’99

SPIRE’99

1998 de Moura, Navarro, Ziviani, andBaeza-Yates

Word based encoding

Previous researches

Recent researches

Shift-And algorithm

Page 6: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<6/32>

Main results

The new algorithm scans a compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing.The algorithm is about 1.3 times faster than our previous one which simulates the AC automaton.The algorithm is about 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm.

Our main results

|D| : size of the dictionary trie n : compressed text length m : pattern length r : number of pattern occurrences

Page 7: Shift-And Approach  to Pattern Matching in LZW Compressed Text

Lempel-Ziv-Welch Compression

how to compress and decompress

Page 8: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<8/32>

LZW compression

a b ab ab ba b c aba bc abab1 2 34 5 6 9 114 2

Original text:

Compressed text:

Dictionary trieb

a b c

a

a a

a

bb

b c

0

1 2 3

4 5

6 7

9

8 12

10

11

aba6

6

a

a

b

Lempel-Ziv-Welch(LZW) compression

O(|D|) = O(n)

Page 9: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<9/32>

Move of compression

a b ab ab ba b c aba bc abab1 2 34 5 6 9 114 2

Original text:

Compressed text:

Dictionary trie

a b c0

1 2 3b

4a5

a6

b7

b8

c9

a10

b11

a12

How to compress a text

Page 10: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<10/32>

Move of decompression

1 2 34 5 6 9 114 2Original text:

Compressed text:

How to decompress a compressed text

a b ab ab ba b c aba bc abab

Dictionary trie

a b c0

1 2 3b

4a5

a6

b7

b8

c9

a10

b11

a12

O(n) time

O(N) time

Page 11: Shift-And Approach  to Pattern Matching in LZW Compressed Text

Compressed Pattern Matchingin LZW Compressed Text

with Shift-And approach

Page 12: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<12/32>

Shift-And approach to pattern matching

10000

abac

aaabaacaabacabtext:

pattern: aabac

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

10000

11000

11000

11010

&

a a b a c abc11010

00100

00001

mask bits

abac

a

Shift-And approach to pattern matching

Pattern was found!

(Baeza-Yates and Gonnet[1992], Wu and Manber[1992])

Page 13: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<13/32>

Property of SA approach

Properties of Shift-And approach

Simple, but very fast when a pattern length m is not greater than the word length of typical computers (32 or 64).Assuming m32 (or 64) and that bit-shift operations and bitwise logical operations on integers can be performed in constant time, it runs in O(n) time. This method has many variations

generalized pattern matching pattern matching with k-mismatch pattern matching for multiple patterns

Page 14: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<14/32>

aabaacaabacab

abac

atext:

Basic idea

10000

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

a ab aa ac a a b a c

Jump! Jump!

pattern: aabac

Basic idea of our algorithm

abc11010

00100

00001

mask bits

10000

11000

10000

6 151compressedtext :

O(1) time?

Page 15: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<15/32>

Basic idea

aabaacaabacab

abac

atext:

10000

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

abc11010

00100

00001

mask bits

10000

11000

10000

We need a mechanism for reporting all pattern occurrences.

pattern: aabac6 151compressed

text :

Pattern was found!

1

Basic idea of our algorithm

Page 16: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<16/32>

Main results

Lemma 1 (Realization of ‘Jump’)The state transition function can be realized in O(|D|+m) time using O(|D|) space, and return the value in O(1) time.

Lemma 2 (Realization of ‘Output ’)The procedure which enumerates the pattern occurrences can be realized in O(|D|+m) time using O(|D|) space, and run in O(r) time.

Technical details

|D| : size of the dictionary trie m : pattern length r : number of pattern occurrences

Page 17: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<17/32>

Overview of the algorithm

Overview of the algorithm

Input. pattern P, u1,u2, …,un : LZW compressed text.Output. All occurrences of the patterns.

^Construct mask bits from P.Initialize the dictionary trie, M, U, and V;

l:=0; S:=;

for i:=1 to n do begin for each dOutput(S, ui) do report ‘pattern occurs at position l+d ’; S:= f (S, u); /* Jump the state! */ l:= l+ |ui|; /* increment the offset */ Update the dictionary trie, M, U, and V;end

Page 18: Shift-And Approach  to Pattern Matching in LZW Compressed Text

Detail of our Algorithm

Realization of Jump and Output

Page 19: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<19/32>

Detail of ‘Jump’

for a ∈Σ, u ∈Σ*, and S∈{1,・・・ , m},•

Detail of ‘Jump’

10000

11000

11010

&

state transition

10100

state S={1,3}M(a)={1,2,4}M(b)={3}M(c)={5}

abc11010

00100

00001

abac

a

mask bits

f (S, a) : ((S 1)∪{1}) ∩ M(a)M(a) : { 1 i m | Pattern[i] = a }

bit shift OR AND

Page 20: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<20/32>

Detail of ‘Jump’

f (S, a) : ((S 1)∪{1}) ∩ M(a)M(a) : { 1 i m | Pattern[i] = a }

for a ∈Σ, u ∈Σ*, and S∈{1,・・・ , m},•

f f ((SS, , uu) = (() = ((S S ||uu|)|)∪∪{1,{1, ・・・・・・ , , |u||u|}) }) ∩ ∩ MM((uu))^^ ^^

O(1)

Detail of ‘Jump’

M(u) :: f({1,・・・ , m}, u)^ ^definerecursively

f f ((SS,,εε) :) : SS f f ((SS, , uaua) :) : f f ( ( f f ((SS, , uu), ), aa))^^^^ ^^

Page 21: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<21/32>

Move of ‘Jump’

aba10010

abac

aacaabac

00001

M(u)^10000

100

10010

10010

&

10000

abac

aaabaacaabacabtext:

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

Move of f (S, u)^

111

Page 22: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<22/32>

10000

aba10010

abac

aacaabac

00001

M(u)^

Move of ‘Jump’

Move of f (S, u)^

00001

00001

&

10000

abac

aaabaacaabacabtext:

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

111111

Page 23: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<23/32>

Detail of updating Mhat(u)

How to calculate M(u)^

MM((u u aa)) = f({1,・・・ , m}, u a)^^ ^= f ( f({1,・・・ , m}, u), a )^

= f ( M(u), a )^

= ((((MM((uu)) 1)1)∪∪{1}){1})∩∩MM((aa))^

u a

u

a

Dictionary trie D

M(u)^

M(u a)^

O(1)

total:O(|D|) time and space

Page 24: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<24/32>

Detail of Output(S,u)

Output(S, u) = { 1 j |u| | m∈S }

How to enumerate the occurrences

2

11

Output(S, u) ={ 2, 11}

uS

length i prefix of the pattern for the largest i∈S.

patternoccurrence

patternoccurrence

2{1, ...,m}D

Page 25: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<25/32>

Two subset U and A

U(u) : {1 j |u| | i < m and u[1..i]=Pattern[m-i+1..m]}

V(u) : {1 j |u| | i m and u[1-m+1..i]=Pattern}

Output(S, u) =((m S) U(u)) V(u)

Realization of Output(S, u)

dependent on S independent of S

uS

Page 26: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<26/32>

Detail of updating U and A

How to calculate U(u) and V(u)

u a

u

a

Dictionary trie DU(ua)V(ua)

U(u)V(u)

total:O(|D|) time and space

if m∈M(ua) then U(ua) = U(u) {|u a|}else U(ua) = U(u) ;

We can deal with V(n) as the same way of [DCC’98].

O(1)

Page 27: Shift-And Approach  to Pattern Matching in LZW Compressed Text

-- Is this really practical? --

But... Is it But... Is it really fast ?really fast ?

Uhmm....Uhmm....

Page 28: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<28/32>

Experimentation

◆ Method 1:

◆ Method 2:

CompressedText bcbababc 9

CompressedText

Shift-And

Our previousalgorithm(DCC’98)

◆ Method 3:

Experimental Comparisons

Decompress !

CompressedText

Our new algorithms

Page 29: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<29/32>

Experimentation

Original Text"The Brown corpus"

6.8 MbytesCompressed Text3.4 Mbytes

Language: C (with gcc compiler)Machine : Sun SPARCstation 20 with

remote disk storageFile transfer ratio: 0.96 Mbyte/sec

compresscompress(UNIX command)(UNIX command)

Experimental Comparisons

Page 30: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<30/32>

Experimental results

Experimental results

uncompressedtext

Shift-And

CPU time + File I/O time

1.3 timesfaster!

1.5 timesfaster!

elapsed time(s)

6.05

7.31

8.16

CPU time(s)

Shift-And with decompressionOur previous

algorithm(DCC’98)

New algorithmNew algorithm

7.52

6.57

5.15

Method

Page 31: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<31/32>

Experimental results

Experimental results

Shift-And in original text 9.363.09

elapsed time(s)

6.05

7.31

8.16

CPU time(s)

Shift-And with decompressionOur previous

algorithm(DCC’98)

New algorithmNew algorithm

7.52

6.57

5.15

Method

Page 32: Shift-And Approach  to Pattern Matching in LZW Compressed Text

<32/32>

Conclusion

Conclusion

The proposed algorithm scans an LZW compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing.

We implemented the algorithm, and showed that it is approximately 1.3 times faster than our previous algorithm.

Our new algorithm has several extensions. generalized pattern matching pattern matching with k-mismatches pattern matching for multiple patterns