Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern...
-
Upload
syed-darrington -
Category
Documents
-
view
215 -
download
0
Transcript of Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern...
北海道大学 Hokkaido University
Lecture on Information Knowledge Network
"Information retrieval and pattern matching"
Laboratory of Information Knowledge Network,Division of Computer Science,
Graduate School of Information Science and Technology,Hokkaido University
Takuya KIDA
1
Lecture on Information knowledge network2011/11/22
The 6thPattern matching on
compression text
About data compressionMotivation and aim of this study
Pattern matching on Huffman encoded textPattern matching on LZW compressed text
Unified framework: Collage systemAspect of speeding-up of pattern matching
by text compression: BPE compression2011/11/22 Lecture on Information
knowledge network
2
北海道大学 Hokkaido University3
Lecture on Information knowledge network2011/11/22
About data compression
Lossless compression Lossy compression
LZ77
SequiturLZ78
BPELZW
JPEG
MPEG
MP3
used for image and voice data
Entropy encoding
Huffmanencoding
Arithmeticencoding
Non-universal encoding
Run-length
BWT
Universal encoding
Dictionary-based
sort-based
Grammar-based
PPM
Statistical
※reference : Managing Gigabytes: Compressing and Indexing Documents and Images, I. H. Witten, A. Moffat, T. C. Bell, Morgan Kaufmann Pub, 1999.
北海道大学 Hokkaido University4
Lecture on Information knowledge network2011/11/22
Compressedtext
Compressedtext Original textOriginal text
decompress
Ordinal pattern matching machine
Pattern matching machinefor compressed textsCompressed
textCompressed
text
Aim of this study
Original textOriginal text
Ordinal pattern matching machine
北海道大学 Hokkaido University5
Lecture on Information knowledge network2011/11/22
Example of application
E-mailsE-mails
DirectoriesDirectories
Schedule tablesSchedule tablesE2J/J2E dictionariesE2J/J2E dictionaries
Business cardsBusiness cards
Short memosShort memos
E-booksE-books
KOJIENKOJIEN
Personal databasesPersonal databases
We want to pack a lot of data into a small computer such as a mobile phone and PDA as much as possible!
Because of small amount of memory, to construct an extra index structure isn’t good solution!
However, we want to retrieve at high speed!
※ 写真は sharp mi110 と東芝 V601T
北海道大学 Hokkaido University6
Lecture on Information knowledge network2011/11/22
Difficulty of PM on compressed texts
There might hardly be "To decrease capacity, the text data is preserved by compressing it" in the category that personally uses the computer today when the capacity of the hard disk and the memory has grown enough. I have not used this function though the function to reduce capacity putting compression on Windows in each folder is provided. It will be seemed as an advantage none to compress the text data because there are 100 harms though preserving it by compressing it if it is a multimedia data like the image and the voice data, etc. is natural. However, the good policy doing the compression preservation deleting neither for instance a large amount of log file nor past mail data, etc.In a word
There might hardly be "To decrease capacity, the text data is preserved by compressing it" in the category that personally uses the computer today when the capacity of the hard disk and the memory has grown enough. I have not used this function though the function to reduce capacity putting compression on Windows in each folder is provided. It will be seemed as an advantage none to compress the text data because there are 100 harms though preserving it by compressing it if it is a multimedia data like the image and the voice data, etc. is natural. However, the good policy doing the compression preservation deleting neither for instance a large amount of log file nor past mail data, etc.In a word
Document files
Compressed document files
011110000111100111111101011010001010101001111010001011100110101111011000111011111101001101011111001101001110011011000001111110101101011111111100000101001001010011010
0111100001111001111111010110100010101010011110100010111001101011110110001110111111010011010111110011010011100110110000011111101011010111111111000001010010010100110101. The starting position of each
codeword is invisible2. Representation of each string
is not unique
北海道大学 Hokkaido University7
Lecture on Information knowledge network2011/11/22
Search-without-decompress method
Search-on-the-fly method
Decompress-then-search method
Our goal
Goal : Do pattern matchingfaster than the above!
※ 上図イラストは竹田正幸先生の作
北海道大学 Hokkaido University8
Lecture on Information knowledge network2011/11/22
Lempel-Ziv-Welch (LZW) compression
a b ab ab ba b c aba bc abab1 2 34 5 6 9 114 2
Text T:
Compressed text E(T):
※ LZW is used for UNIX compress command, GIF image format, and so on.
T. A. Welch: A technique for high performance data compression, IEEE Comput., Vol.17, pp.8-19, 1984.
|D| = O(compressed text length)
Let D be the set of strings entered in the dictionary trie
D = {a, b, c, ab, ba, bc, ca, aba, abb, bab, bca, abab}
Dictionary trie
0
1 2 3
ca b
4
b5
a9
c10
a
6
a7
b8
b12
a
11b D is constructed adaptively
0
1 2 3
4 5 9 10
6 7 8 12
11 Dictionary trie
ca b
b a c a
a b b a
b
北海道大学 Hokkaido University9
Lecture on Information knowledge network2011/11/22
Move of Aho-Corasick PM machine
AC machine for pattern set Π= {aba, ababb, abca, bb}
a0 1 2 3 4 5
6 7
98
b ba b
c ab
b {bb}
{abca}
{aba} {ababb, bb}
: goto function
: failure function{ } : output
abababba0 1 2 3 4 3 4 5 1
abaOutput : ababb
ababb
Text :
Current state :
北海道大学 Hokkaido University10
Lecture on Information knowledge network2011/11/22
1 2 3 4 3 4 5 1
Idea for doing pattern matching on LZW texts
To simulate the move of AC machine on LZW compressed texts
Comp. text :
a0 1 2 3 4 5
6 7
98
b ba b
c ab
b {bb}
{abca}
{aba} {ababb, bb}
abababba0 1 2
abaOutput : ababb
ababb
Text :
Current state :
1 2 4 4 5
4 4 1
T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa: Multiple pattern matching in LZW compressed text, Proc. Data Compression Conference, pp. 103-112, IEEE Computer Society, Mar. 1998.
: goto function
: failure function{ } : output
北海道大学 Hokkaido University11
Lecture on Information knowledge network2011/11/22
Core functions : Jump & Output
Can we compute two functions Jump and Output well?–function Jump(q, u) :
simulates the consecutive transitions caused by string u in O(1) time.
The domain is Q×D. returns the state number
of AC machine–function Output(q, u) :
reports the occurrences within the string obtained by concatenating the string corresponding to state q and string u in O(r) time.
The domain is Q×D. returns the set of
pattern IDs
It needs O(m|D|) space by a naïve
way.
It can be realized in O(m2+|D|)
space!
It needs O(m|D|) space by a naïve
way.
It can be realized in O(m2+|D|)
space!
北海道大学 Hokkaido University12
Lecture on Information knowledge network2011/11/22
function Jump
δ(q, u)δ(ε, u)
if u is a factor of some pattern,
otherwise.Jump(q, u) =
O(m3) space
O(|D|) space
Ancestor(N'1 (q, u'), |u'| - |u|)
δ(ε, u)if u is a factor of some pattern,
otherwise.
Jump(q, u) =
O(m2) space※
O(|D|) space
O(m2) space
Let δ(q,u) be the (extended) state transition function※ of the AC machine.
※ δ(q,u) returns the state position after making transition from the state q by string u. ※ u’ is the string corresponding to the nearest ancestor node of u that is also explicit on the generalized suffix trie for P.
北海道大学 Hokkaido University13
Lecture on Information knowledge network2011/11/22
function Output
u~: the longest prefix of u that is also a suffix of a pattern.
A(u) = {〈 i,p 〉 | p∈Π, |u|< i <|u|, |p|< i, and u[i - |p|+1...i ]=p }
~
Output(q, u) = Output(q, u) ∪ A(u) ~
q u
p1p2
u~p1
p2
O(|D|) spaceO(m2) spaceNote that state q corresponds
to a prefix of some pattern
北海道大学 Hokkaido University14
Lecture on Information knowledge network2011/11/22
Pseudo code of Kida, et al.[1998]’s algorithm
PMonLZW (E(T) = u1u2…un, Π: pattern set)1 Construct AC machine and generalized suffix trie for Π;2 Initialize the dictionary trie for E(T);3 Preprocess Jump(q,u) and Output(q,u)
for any q and u {a pattern π Π∈ ∈ の factor} 4 l ← 0;5 q ← q0;6 for i ← 1…n do7 for each 〈 d ,π 〉∈ Output(q, ui) do8 report pattern π occurs at position l+d;9 q ← Jump(q, ui);10 l ← l + |ui|;11 Update the dictionary trie;
/* enter the string for node ui+1 into D */12 Update variables for Jump(q, ui+1) and Output(q, ui+1);
/* compute δ(ε,ui+1), A(ui+1), ui+1’, and |ui+1| by using its parent info. */13 end of for14 end of for
北海道大学 Hokkaido University15
Lecture on Information knowledge network2011/11/22
The result of Kida, et al. [1998]
The original idea is from–A. Amir, G. Benson, and M. Farach: Let sleeping files lie:
Pattern matching in Z-compressed files, J. Computer and System Sciences, Vol.52, pp.299-307, 1996.
It simulates KMP on LZW compressed texts By simulating Aho-Corasick ( AC ) pattern matching machine, we can do multiple pattern matching.
It takes O(m2 +|D|) time and space for preprocessing.
It scans compressed texts in O(n+r) time with O(m2+|D|) space for multiple patterns, and reports all the occurrences. ※ This firstly appears in “T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa: Multiple
pattern matching in LZW compressed text, Proc. Data Compression Conference, pp. 103-112, IEEE Computer Society, Mar. 1998.” Its Journal ed. Appears in “T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa: Multiple Pattern Matching in LZW Compressed Text, Journal of Discrete Algorithms, 1(1), pp. 133-158, Hermes Science Publishing, Dec. 2000.”
北海道大学 Hokkaido University16
Lecture on Information knowledge network2011/11/22
Idea for applying bit-parallel technique
10000
aabaacaabacab abc
Pattern P:=
abac
aText T:=
aabac
11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
11010
00100
00001
Mask bits
a ab aa ac a a b a c
Jump! Jump!
T. Kida, M. Takeda, A. Shinohara, and S. Arikawa: Shift-And approach to pattern matching in LZW compressed text, Proc. CPM'99, LNCS1645, pp. 1-13, Springer-Verlag, Jul. 1999.
北海道大学 Hokkaido University17
Lecture on Information knowledge network2011/11/22
Extended state updating function f’
For any a∈Σ, u∈Σ*, S∈{1,…, m}, we define as follows.– M(a) = { 1< i < m | P[i] = a }– f(S, a) = ((S ⊕ 1)∪{1}) ∩ M(a)– f’(S,ε) = S and f’(S, ua) = f’( f(S, u), a) – M’(u) = f’({1, ・・・ , m}, u)
Then, for any u∈Σ*, S∈{1, ・・・ , m}, we define as–f’(S, u) = ((S ⊕ |u|)∪{1, ・・・ , |u|}) ∩ M’(u)
O(1) time
O(|D|) time and space
北海道大学 Hokkaido University18
Lecture on Information knowledge network2011/11/22
function Output (Bit-parallel type)
Definition :–Output(S, u) = { 1 < j < |u| | m∈S }–U(u) = {1 < j < |u| | i <m and u[1..i] =P[m-i+1..m] }
–A(u) = {1 < j < |u| | m < i and u[1-m+1..i]=P }
–Output(S, u) =((m ⊖ S)∩U(u)) ∪ A(u)
O(|D|) time and space
O(|D|) time and space
q u
P P(m S)∩U(u)⊖ A(u)
北海道大学 Hokkaido University19
Lecture on Information knowledge network2011/11/22
The result of Kida, et al. [1999]
applied the bit-parallel technique based on Shift-And method to processing of functions Jump and Output to speed up.
It uses O(m+|Σ|) time and space for preprocessing. For a given pattern, it scans a given compressed text in O(n+r) time and O(m+|D|) space, and it reports all the occurrences.
It excels in the extensibility as well as Shift-And method.–pattern matching for a generalized pattern–pattern matching with allowing k mismatches–multiple pattern matching
北海道大学 Hokkaido University20
Lecture on Information knowledge network2011/11/22
Achievement of our aim!
0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
5 10 15 20 25 30Pattern length
CP
U t
ime
(se
c.)
compress(LZW)+KMP
AlphaStation XP1000(Alpha21264: 667MHz)Tru64 UNIX V4.0F
Genbank ( DNA base sequence ) 17.1Mbyte
T. Kida, et al.[1998]
gunzip(LZ77)+KMP
Speeding-up by bit-parallelism[1999]
Search-on-the-fly method
Search-without-decompress method
北海道大学 Hokkaido University21
Lecture on Information knowledge network
Take a breath
2011/11/22
2010.12.24 RG Gundam1/1( @Higashi-Shizuoka Park )
北海道大学 Hokkaido University22
Lecture on Information knowledge network2011/11/22
If …The time for doing pattern matching
on the original text
The time for doing pattern matching
on the original text
The time for doing compressed pattern
matching
The time for doing compressed pattern
matching>
Why do you need compressed PM?
Goal 2 A new goal!We have enough storage space now. Why do you compress small data like
text documents?
×××
×
北海道大学 Hokkaido University23
Lecture on Information knowledge network2011/11/22
A new goal! ( Goal 2 )
0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
5 10 15 20 25 30Pattern length
CP
U t
ime
(se
c.)
Matching by KMP on the original text
Overwhelmingly faster!
compress(LZW)+KMP
AlphaStation XP1000(Alpha21264: 667MHz)Tru64 UNIX V4.0F
Genbank ( DNA base sequence ) 17.1Mbyte
T. Kida, et al.[1998]
gunzip(LZ77)+KMP
Speeding-up by bit-parallelism[1999]
Search-on-the-fly method
Search-without-decompress method
北海道大学 Hokkaido University24
Lecture on Information knowledge network2011/11/22
After substitutions
Text ABABCDEBDEFABDEABCABABCDEBDEFABDEABC
GGCHBHFGHGCGGCHBHFGHGC
GIHBHFGHIGIHBHFGHI
GGCDEBDEFGDEGCGGCDEBDEFGDEGC
G
H
I
9
GHI
ABDEGC
dictionary
→→→
18
Size :256 = 1 byte
Byte Pair Encoding (BPE) method
北海道大学 Hokkaido University25
Lecture on Information knowledge network2011/11/22
Achievement of Goal 2AlphaStation XP1000(Alpha21264: 667MHz)Tru64 UNIX V4.0F
Medline ( English text )60.3Mbyte
5 10 15 20 25 30Pattern length
0.0
0.3
0.4
0.5
0.8
0.1
0.2
0.6
0.7
CP
U t
ime
(se
c.)
Matching by KMP on the original text
Compressed PM on BPE (KMP type)
Search-without-decompress method
Agrep on the original text
Compressed PM on BPE (BM type)Shibata, et al. (2000)
Search-without-decompress method
The fastest in the previous
北海道大学 Hokkaido University26
Lecture on Information knowledge network2011/11/22
Text compressed by BPE
Text compressed by LZSS
ordinal
Text compressed by LZW
The original uncompressed text
for LZSS
for BPE
for LZW
GOAL
GOAL
GOAL
GOAL
1
3
4
2
Low compression
Medium compression
High compression
…but it’s the most suitable for PM!
Summarize the above…
北海道大学 Hokkaido University27
Lecture on Information knowledge network2011/11/22
Paradigm shift 1
Develop pattern matching algorithmsfor each compression methods
Choosing a suitable compressionenables us to accelerate
pattern matching!
Develop a novel compression method
which is suitable for pattern matching!
北海道大学 Hokkaido University28
Lecture on Information knowledge network
Data compression methods for PM
Dense coding type– [ETDC] Nieves R. Brisaboa, Eva Lorenzo Iglesias, Gonzalo Navarro, and Jose R.
Parama:An efficient compression code for text databases, In ECIR2003, pp. 468-481, 2003.
– [SCDC] Nieves R. Brisaboa, Antonio Farina, Gonzalo Navarro, and Maria F. Esteller:(s, c)-dense coding: An optimized compression code for natural language text databases, In SPIRE2003, pp. 122-136, 2003.
– [FibC] Shmuel Tomi Klein and Miri Kopel Ben-Nissan: Using fibonacci compression codes as alternatives to dense codes, In DCC2008, pp. 472-481, 2008.
– [SVVC] Nieves R. Brisaboa, Antonio Farina, Juan-Ramon Lopez, Gonzalo Navarro, and Eduardo R. Lopez: A new searchable variable-to-variable compressor, In DCC2010, pp. 199-208, 2010.
VF coding type (including grammar-based compressions)– [BPEX] Shirou Maruyama, Yohei Tanaka, Hiroshi Sakamoto, and Masayuki Takeda:
Context-sensitive grammar transform: Compression and pattern matching, In SPIRE2008, LNCS5280, pp. 27-38, Nov. 2008.
– [DynC] Shmuel T. Klein and Dana Shapira: Improved variable-to-fixed length codes, In SPIRE2008, pp. 39-50, 2009.
– [STVF] Takashi Uemura, Satoshi Yoshida, Takuya Kida, Tatsuya Asai, and Seishi Okamoto: Training parse trees for efficient VF coding, In SPIRE2010, pp. 179-184, 2010.
2011/11/22
北海道大学 Hokkaido University29
Lecture on Information knowledge network2011/11/22
Paradigm shift 2
We use the data compression technology to reduce the cost for storing
and transferring the data.
We can speed up pattern matching by compressing the data.
Break difficulties of various processing by using the compression technology!
北海道大学 Hokkaido University30
Lecture on Information knowledge network2011/11/22
Doing something by using compression
Speeding up the calculation of similarity between two long strings by compression technique.
– “A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices”,M. Crochemore, G. M. Landau, and M. Ziv-Ukelson, Proceeding of 13th Symposium on Discrete Algorithm, pp.679-688, 2002
Processing a very huge graph structure on memory at high speed by compression technique.–Shinichi Nakano ( Gunma University ) “ Graph compression
with query support”Their method can represent a triangulated planar graph in 2m+o(n) bit and moreover can support some queries on it.
Speeding up the query processing for XML data by compression technique.– Tetsuya Maita and Hiroshi Sakamoto ( Kyushu Institute of
Technology )
北海道大学 Hokkaido University31
Lecture on Information knowledge network2011/11/22
The 6th summary
Pattern matching algorithms on compressed texts– Pattern matching on Huffman encoded text → automaton with synchronization– Pattern matching on LZW compressed text → simulating the move of KMP(AC) on
the compressed text Unified framework: Collage system
– A formal system to represent a text compressed by lexicographical compression method
– We have clarified what kind of compression methods are suitable for pattern matching.
Aspect of speed-up pattern matching by compression– BPE compression: it has low compression ratio, but it can speed up pattern
matching– Our experimental results showed that we could do pattern matching faster than
doing on the original texts A big paradigm shift caused
– The data compression technology can be used in the other purposes rather than reducing the data size
The next theme (which is the final topic of "Information retrieval and pattern matching“)
– Various topics I didn’t mention about