Efficient Decodable and Searchable Natural Language Adaptive Compression Salvador de Bahía. Aug...

35
Efficient Decodable and Searchable Natural Language Adaptive Compression Salvador de Bahía. Aug 2005 28 th Annual International ACM SIGIR Gonzalo Navarro (DCC – U. Chile) Nieves R. Brisaboa (UDC – Spain) José R. Paramá (UDC – Spain) Antonio Fariña (UDC – Spain)
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    1

Transcript of Efficient Decodable and Searchable Natural Language Adaptive Compression Salvador de Bahía. Aug...

Efficient Decodable and Searchable Natural Language Adaptive

Compression

Salvador de Bahía. Aug 2005

28th Annual International ACM SIGIR

Gonzalo Navarro (DCC – U. Chile)

Nieves R. Brisaboa (UDC – Spain)

José R. Paramá (UDC – Spain)

Antonio Fariña (UDC – Spain)

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 2

— Semistatic CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code (ETDC)

— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)

OutlineOutlineOutlineOutline

Conclusions

Word-based compression

Introduction

Dynamic Lightweight ETDC (DLETDC)— Basic Ideas

— Searching DLETDC— Empirical Results

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 3

IntroductionIntroduction

Why compression? My huge disk is cheap. Compression reduces not only space !

— Disk access time (your huge disk is slow)

— Transmission time (networks are even slower)

— Search time (less data to process)

Compression can be integrated into Text Retrieval Systems, improving their performance in all aspects

Word-based semistatic models proved successfulWord-based semistatic models proved successful MG system (Witten, Moffat, Bell) Byte-oriented Huffman (Moura, N., Ziviani, Baeza-Yates) End-Tagged Dense Codes (Brisaboa, Fariña, N.)

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 4

IntroductionIntroduction

But what about sending results to a remote receiver?— Uncompressed form (wastes network bandwidth)

— Keep in compressed form (plus the large model of the corpus)

— Semistatic recompression (uneffective for small files)

— Dynamic recompressionDynamic recompression (effective, permits a conversation)Gzip (40%), DETDC (32%), arithmetic encoding (25%), …

Semistatic compression is preferred in Text Databases

� Reduces their size: 25%-30% compression ratio

� Permits direct access and local decompression

� Permits direct search on the compressed text� Searches are up to 8 times faster.

� Decompression is only needed for presenting results

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 5

IntroductionIntroduction

Dynamic compression: Receiver

Decompression (to present results)

Keyword search (to classify documents, alert users, …)

Dictionary-based (Ziv-Lempel):

Fast decompression

Decent direct searching

Not so good compression ratios

Bad for low bandwidth networks

Statistical (arithmetic, PPM):

Good compression ratios

Costly at sender and receiver side

Direct searching impossible

Bad for weak receivers

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 6

IntroductionIntroduction Our contribution: DLETC

—New dynamic compressor for text databasesSimple to understand and implement

—StatisticalGood compression ratio

Good for low-bandwidth networks

—Easier to handle by the receiverBreaks the usual sender/receiver symmetry

Very fast decompression

Good for weak receivers

—Searchable without decompressingFirst statistical method permitting direct search

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 7

— Semistatic CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code (ETDC)

— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)

OutlineOutlineOutlineOutline

Conclusions

Word-based compression

Introduction

Dynamic Lightweight ETDC (DLETDC)— Basic Ideas

— Searching DLETDC— Empirical Results

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 8

Semistatic compressionSemistatic compression

Statistical semistatic compression

— 2 passes

Gather frequencies of words and encoding

Compression (substitution word codeword)

— Association between source symbol codeword does not change across the text

— Direct search is possible— Most representative method: Huffman.

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 9

Word-based HuffmanWord-based Huffman

Moffat proposed the use of words instead of characters as the coding alphabet

Distribution of words more biased than that of characters

0

2

4

6

8

10

12

14

16

18

E A O L S N D R U I T C P M Y Q B H G F V J Ñ Z X K W

n

Word freq

distribution

Character freq distribution (Spanish)

Compression ratio about 25% (English texts) This idea joins the requirements of compression

algorithms and of Information Retrieval systems

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 10

Plain Huffman & Tagged HuffmanPlain Huffman & Tagged Huffman

Moura, Navarro, Ziviani and Baeza: — 2 new techniques: Plain Huffman and Tagged Huffman

Common features— Huffman-based

— Word-based

— Byte- rather than bit-oriented (compression ratio ±30%)

Plain Huffman = Huffman over bytes (256-ary tree) Tagged Huffman flags the beginning of each codeword

First bit is:•“1” for 1st bit of 1st byte•“0” for 1st bit remaining bytes

1xxxxxxx 0xxxxxxx 0xxxxxxx

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 11

Plain Huffman & Tagged HuffmanPlain Huffman & Tagged Huffman

Differences:

— Plain Huffman. (tree of arity 2b=256)

— Tagged Huffman. (tree of arity 2b-1 =128)

Loss of compression ratio (3.5 perc. points)

Tag marks beginning of codewords

Direct search (improved searches)Boyer-Moore

Random access (random decompression)

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 12

— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code (ETDC)

— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)

OutlineOutlineOutlineOutline

Conclusions

Word-based compression

Introduction

Dynamic Lightweight ETDC (DLETDC)— Basic Ideas

— Searching DLETDC— Empirical Results

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 13

End-Tagged Dense CodeEnd-Tagged Dense Code

Small change: Flag signal the end of a codeword

Prefix code independently of the remaining 7 bits of each byte

Huffman tree is not needed: Dense coding.• Improving Tagged Huffman compression ratio by 2.5 perc. points

Flag bit Same Tagged Huffman searching capabilities

First bit is: “1” --> for 1st bit of last byte “0” --> for 1st bit remaining bytes

1xxxxxxx

0xxxxxxx

Two-byte codeword 1xxxxxxx0xxxxxxx

Three-byte codeword 1xxxxxxx0xxxxxxx0xxxxxxx

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 14

End-Tagged Dense CodeEnd-Tagged Dense Code Encoding scheme

.....

Words from 128+ 1282+1 to

128 +1282 +1283 use three bytes

(1283 = 221 codewords)

00000000:00000000:10000000

……

01111111:01111111:11111111

Words from 128+1 to 128+1282 are encoded using two bytes

(1282 = 214 codewords)

00000000:10000000

…..

01111111:11111111

First 128 words are encoded using one byte (27 codewords)

1000000010000001…..11111111

Codewords depend on the rank, not on the frequency

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 15

— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code (ETDC)

— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)

OutlineOutlineOutlineOutline

Conclusions

Word-based compression

Introduction

Dynamic Lightweight ETDC (DLETDC)— Basic Ideas

— Searching DLETDC— Empirical Results

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 16

Dynamic codesDynamic codes

Statistical dynamic compression

— Dynamic modelingFrequencies are computed dynamically as text arrives.

— Sender and receiver…

Start with an empty model

Adapt their model during the process.

Both processes are symmetric

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 17

Dynamic End-Tagged Dense CodeDynamic End-Tagged Dense Code

Symmetric sender and receiver. It uses the on-the-fly encoding and decoding algorithms

— Requirement: maintaining the vocabulary sorted by freq.

speed

Sender’s vocabulary

1 the 8

2 is 4

3 code 2

4 tag 2

5 Huffman 2

6

7 ratio 1

8 1

ETDC

speed C8 = encode (8)

Exampleword freq

speed

ETDC

21

send codeword increase frequencyexchange with top

12 The codeword given to a word si may

change each time si is processed.

DETDC …

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 18

rose

Dynamic End-Tagged Dense Code: transmissionDynamic End-Tagged Dense Code: transmission

senderreceiver

Example: … a rose roserose is a very nice one

rose

a126

127

128

129

1

C127

rose

1 a

rose

126

127

128

129

1

1

vocabulary

word freq

vocabulary

word freq

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 19

is

is

is 1

126

127

128

129

vocabulary

word freq

Dynamic End-Tagged Dense Code: transmissionDynamic End-Tagged Dense Code: transmission

1

rose

senderreceiver

is

a

2

1

rose

a

2

is

is 1

Example: … a rose rose isis a very nice one

126

127

128

129

vocabulary

word freq

Cnew

0|128

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 20

— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code (ETDC)

— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)

OutlineOutlineOutlineOutline

Conclusions

Word-based compression

Introduction

Dynamic Lightweight ETDC (DLETDC)— Basic Ideas— Searching DLETDC— Empirical Results

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 21

DLETDC – Basic IdeasDLETDC – Basic Ideas

The idea:— Break the correspondence between position and codeword— Codewords are mantained explicitly by the sender— SENDER: After processing a symbol si …

Exchange si top(fi) to keep the vocabulary sorted,

Keep original codewords unless codeword length should vary

Otherwise swap, and notify the receiver.

— 2 fixed slots are reserved: new-symbol, ascii word

Swap, Ci, CodeInj.

New-symbolswap

0

1

128129

25500

127128129

128129

1270

1651116512

2550 128

016513 0 129

1272113662 127 2542113663 127 127 255

02113664 0 0 128

1272113661 127 253

Rank

1-byte

2-byte

3-byte

— RECEIVER: does not maintain frequencies

Decodes codewords

Adds new symbols (Cnew-symbol)

Only swaps when it decodes a Cswap

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 22

C127

very 1

126

127

128

129

vocabulary

word

DLETDC: transmissionDLETDC: transmission

senderreceiver

a

1

1 is

a

a

very

Example: … a rose rose is aa very nice one

126

127

128

129

vocabulary

word freq

125 rose125 rose 2 C125

is C126

a C127

C128

1-by

te2-

byte

No codeword swap needed since: “a” C127 and “is” C126

code

2

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 23

128

1127

2

CswapC128C126

1

126

127

128

129

vocabulary

word

DLETDC: transmissionDLETDC: transmission

senderreceiver

very

is

a

very

very

Example: … a rose rose is a veryvery nice one

126

129

vocabulary

word freq

125 rose125 rose 2 C125

a C127

is C126

very C128

1-by

te2-

byte

A codeword swap is needed

code

2

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 24

128

130

Cnew

128

2127

2

1

126

127

129

vocabulary

word

DLETDC: transmissionDLETDC: transmission

senderreceiver

nice

very

a

nice

is

Example: … a rose rose is a very nicenice one

126

129

vocabulary

word freq

125 rose125 rose 2 C125

a C127

very C126

is C128

1-by

te2-

byte

A new word case

code

nice

130

nice 1 C129 nice

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 25

DLETDC – Evolution of swaps.DLETDC – Evolution of swaps.

Swaps worsen compression ratio and processing time How many swaps occur in practice?

ZIFF corpus4,6x107 words

237,622 diff words

Between lengts 1 and 2Between lengts 2 and 3

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 26

— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code (ETDC)

— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)

OutlineOutlineOutlineOutline

Conclusions

Word-based compression

Introduction

Dynamic Lightweight ETDC (DLETDC)— The code— Searching DLETDC— Empirical Results

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 27

DLETDC – Searches – Keyword filteringDLETDC – Searches – Keyword filtering Multipattern Horspool

— A trie is traversed to recognize text backwards

Search process:

— Initial phaseSearch for the ASCII pattern: C_new | (pattern)

— When foundReplace in the trie by its codeword: C_pat

Need also to monitor C_new to know C_pat

— Watch the swaps

Track C_swap C_pat * and C_swap * C_pat It suffices to search for:

• C_new• C_pat

and make some neighborhood checks

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 28

— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code (ETDC)

— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)

OutlineOutlineOutlineOutline

Conclusions

Word-based compression

Introduction

Dynamic Lightweight ETDC (DLETDC)— The code— Searching DLETDC— Empirical Results

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 29

Empirical ResultsEmpirical Results

We used some text collections from TREC-2 & TREC- 4, to perform the experiments

Results in:— Compression ratio

— Compression time

— Decompression time

— Search speed

CORPUS size (bytes) # words Diff. words

CALGARY 2,131,045 528,611 30,995

FT91 14,749,355 3,135,383 75,681

CR 51,085,545 10,230,907 117,713

FT92 175,449,235 36,803,204 284,892

ZIFF 185,220,215 40,866,492 237,622

FT93 197,586,294 42,063,804 291,427

FT94 203,783,923 43,335,126 295,018

AP 250,714,271 53,349,620 269,141

ALL FT 591,568,807 124,971,944 577,352

ALL 1,080,719,883 229,596,845 886,190— Dual Intel Pentium-III 800 Mhz with 768Mb RAM.

Debian GNU/Linux (kernel 2.2.19)

gcc 3.3.3 20040429 and –O9 optimizations

Time represents CPU user-time

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 30

Empirical Results. Compression RatioEmpirical Results. Compression Ratio

gzip DLETDC DETDC arith bzip2

26

27

28

29

30

31

32

33

34

35co

mp

ress

ion

ra

tio (

%)

technique

35.09 33.69 33.66 27.98 25.98

Comparison with other techniques.

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 31

Empirical Results. Compression timeEmpirical Results. Compression time

gzip DLETDC DETDC arith bzip2

100

200

300

400

500

600

700C

om

pre

ssio

n ti

me

(se

c)

technique

431.5 154.6 150.1 510.0 1342.0

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 32

Empirical Results. Decompression timeEmpirical Results. Decompression time

gzip DLETDC DETDC arith bzip2

50

100

150

200

250

300

350

400

De

om

pre

ssio

n ti

me

(se

c)

technique

59.8 49.5 75.8 394.1 432.4

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 33

Search speedSearch speed

5 10 15 20 25

2

4

6

8

10

12

14

16

18

20

Avg

se

arc

h ti

me

(se

c)

number of Patterns

Multi-pattern searches

Plain Text

DLETDC

3.5

2

3.4

6

4.3

8

4.4

9

5.0

1

10

.14

13

.80

15

.84

17

.61

18

.70

Avg time over 10,000 random searches for words of length >= 6

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 34

— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code (ETDC)

— Dynamic CompressorsDynamic End-Tagged Dense Code (DETDC)

OutlineOutlineOutlineOutline

Conclusions

Word-based compression

Introduction

Dynamic Lightweight ETDC (DLETDC)— Basic Ideas

— Searching DLETDC— Empirical Results

August 2005 ACM SIGIR’05 – Efficiently Decodable and Searchable N.L. Adaptive Compression 35

ConclusionsConclusions

We have presented DLETDC— A new dynamic “dense” compressor.

— Designed for low bandwidth and weak receivers.

Competitive compression ratio (around 32-34%, < gzip) Fast at compression (much faster than other adaptive) Very fast at decompression (even faster than gunzip!) Very fast at searching (>3 times faster than plain searching)

DLETDC breaks the common symmetry between sender and receiver in adaptive techniques.— Lightweight and fast decompressor/receiver.

— A dynamic searchable technique.