Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †,...
-
Upload
rosamond-grant -
Category
Documents
-
view
218 -
download
0
Transcript of Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †,...
![Page 1: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/1.jpg)
Faster Approximate String Matching over Compressed Text
ByGonzalo Navarro*, Takuya Kida†, Masayuki Takeda†,
Ayumi Shinohara†, and Setsuo Arikawa†
* Dept. of Computer Science, University of Chile
† Dept. of Informatics, Kyushu University
![Page 2: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/2.jpg)
Contents
• Introduction– Motivation– Related works and our goal
• Our search approach on LZ78/LZW– Basic idea – Filtration technique– Multiple pattern matching algorithms on
compressed text
• Experimental results• Conclusion
![Page 3: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/3.jpg)
Motivation
• Compressed pattern matching– Let sleeping files lie.– Reduce space, reduce searching time.
File transfer
on Memory
Search
on Secondary disk storage
Decompress
on Memory
![Page 4: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/4.jpg)
Motivation
File transfer
on Memoryon Secondary disk storage
Search directly
• Compressed pattern matching– Let sleeping files lie– Reduce space, reduce searching time
![Page 5: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/5.jpg)
Related Works (1)
1988 Eliam-Tzoreff and Vishkin run-length
1992 Amir, Landau, and Vishkin two-dimensional run-length
1995 Farach and Thorup LZ77
1996 Amir, Benson and Farach LZW
1997 Karpinski, Rytter, and Shinohara straight-line programs
1996 Gąsieniec, et al. LZ77
1997 Miyazaki, Shinohara, and Takeda straight-line programs
1992 Amir and Benson two-dimensional run-length
Amir, Benson, and Farach1994 two-dimensional run-length
1997 Takeda finite state encoding
1998 Shibata, et al. byte pair encoding
1994 Manber original compression scheme
1998 Miyazaki, et al. Huffman encoding
1998 Kida, et al. LZ78/LZW
year researcher compression
1998 Moura, et al. Word based encoding
![Page 6: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/6.jpg)
Related Works (2)year researcher compression
1999 Shibata, et al. Antidictionary based
1999 Kida, et al. LZ78/LZW
2000 Shibata, et al. collage systems
1999 Navarro and Raffinot LZ family, Hybrid LZ
Kida, et al.1999 Dictionary based methods(Collage system)
2000 Kärkkäinen, Navarro and Ukkonen LZ family
2000 Matsumoto, et al. Simple collage systems
2000 Navarro and Tarhio LZ family
1999 Gąsieniec and Rytter LZW
2000 Klein and Shapira LZSS variant
2001 Klein and Shapira Huffman encoding
![Page 7: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/7.jpg)
Approximate String Matching• Edit distance ed(P, P’)
– Insertions, deletions and replacements
• Report all occurrences of any string P’ s.t. ed(P, P’)k for a given pattern P.
• Survey paperG. Navarro. A guided tour to approximate string matching. ACM Computing Surverys, 2000.
Text: ACCCTGTTTAGATCACGGCACTACTGTAAAC
Pattern: TAAATCACGGCATACT
k = 2
Example.
![Page 8: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/8.jpg)
Previous Results
• J. Kärkkäinen, G. Navarro, and E. Ukkonen.Approximate string matching over Ziv-Lempel compressed text. In Proc. CPM2000.– Dynamic programming technique
– O(mkn+R) worst case, O(k2n+R) average case
• T. Matsumoto, T. Kida, M. Takeda, A. Shinohara, and S. Arikawa.Bit-parallel approach to approximate string matching in compressed texts. In Proc. SPIRE2000.– Bit-parallel technique
– O(mk3n/w) worst case
![Page 9: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/9.jpg)
Our Search Approach on LZ78/LZW
• Introduction– Motivation– Related works and our goal
• Our search approach on LZ78/LZW– Basic idea– Multiple pattern matching algorithms on
compressed text
• Experimental results• Conclusion
![Page 10: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/10.jpg)
Basic Idea
• Filtration technique (Wu and Manber, 1992)– Split the pattern in k+1 equal-length pieces– Find pattern pieces – Multiple pattern matching– Direct verification of candidate text area
(We have chosen Myers’ algorithm)
Text: ACCCTGTTTAGATCACGGCACTACTGTAAAC
Pattern pieces: TAAAT, CACGG, CATACT
k = 2Pattern: TAAATCACGGCATACT
Example.
![Page 11: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/11.jpg)
Why LZ78/LZW?
• We have already developed a multiple pattern matching algorithm on LZW.
• Easy to decompress locally.
![Page 12: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/12.jpg)
Multiple Pattern Matching Algorithms on Compressed Text
• Aho-Corasick technique
• Boyer-Moore technique
• Bit parallel technique
![Page 13: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/13.jpg)
Aho-Corasick Technique
• T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa,Multiple pattern matching in LZW compressed text. In Proc. DCC’98.
• Simulate the AC machine• Running over LZW directly• O(m2+n+R) time, O(m2+n) space
![Page 14: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/14.jpg)
Aho-Corasick Technique
・・・・・
b1 b2 b3 b4 b5 b6 b7Compressed text:
Original text: ・・・・・CTTAATTAAGCCCCCTGCTAAGCT
T T A A
A
A6
0 1 2 3 4
5
0 1 3 0 0 5 0 1State transition:
Pattern occurrences:
TTAA, AA
AA: goto function: failure function
Patterns: TTAA, AA
/{T,A}
![Page 15: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/15.jpg)
Boyer-Moore Technique
• G. Navarro and J. Tarhio,Boyer-Moore string matching over Ziv-Lempel compressed text. In Proc. CPM2000.
• Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa,A Boyer-Moore type algorithm for compressed pattern matching, In Proc. CPM2000.
T. Kida et al.Multiple Pattern Matching Algorithms on Collage SystemIn Proc. CPM2001, to appear.
![Page 16: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/16.jpg)
Boyer-Moore Technique
1. Find all occurrences that end in the focused block.
2. Calculate the maximum safe shift .
3. Move focus according to .
・・・・・
b1 b2 b3 b4 b5 b6 b7Compressed text:
Original text: ・・・・・CTTAATTAAGCCCCCTGCTAAGCT
Pattern occurrences:
![Page 17: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/17.jpg)
Bit Parallel Technique
• G. Navarro and M. Raffinot,A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. CPM’99.
• T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa,Shift-And approach to pattern matching in LZW compressed text. In Proc. CPM’99.
![Page 18: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/18.jpg)
Bit Parallel Technique
・・・・・
bi-1 bi bi+1Compressed text: ・・・・・
Focused phrase: AAGTTAACTTAAGCCGTT
Pattern: TTAA
(i) Pattern suffixes (iii) Pattern prefixes(ii) Occurrences inside block bi
(i) := 110000000000000000(ii) := 000000100001000000
(iii) := 000000000000000011Bit vectors:
![Page 19: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/19.jpg)
Experimental Results
• Introduction– Motivation– Related works and our goal
• Our search approach on LZ78/LZW– Basic idea– Multiple pattern matching algorithms on
compressed text
• Experimental results• Conclusion
![Page 20: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/20.jpg)
Experimental Results
Intel Pentium III of 550 MHz and 64Mb of RAM running Linux
10Mb of Wall Street Journal articles and 10Mb of DNA data
WSJ was compressed to 42.59% of its size and DNA to 27.71%
![Page 21: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/21.jpg)
Experimental Results
![Page 22: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/22.jpg)
Experimental Results
![Page 23: Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.](https://reader030.fdocuments.in/reader030/viewer/2022033106/5697c0281a28abf838cd6bc6/html5/thumbnails/23.jpg)
Conclusion
• We applied the filtration technique to compressed texts.
• We implemented two new multiple pattern matching algorithms on compressed text.– Boyer-Moore type and Bit-parallel type.
• We showed that this is a practical solution for approximate pattern matching on compressed text.– 10-30 times faster than previous solutions.– Up to 3 times faster than decompressing plus searching.