New Compression Codes for Text Databases A Coruña. April 28th, 2005 Laboratorio de Bases de Datos...

85
New Compression Codes for Text Databases A Coruña. April 28th, 2005 Laboratorio de Bases de Datos Universidade da Coruña Ph.D. Dissertation Ph.D. Dissertation Antonio Fariña Martínez Nieves R. Brisaboa Gonzalo Navarro Advisors

Transcript of New Compression Codes for Text Databases A Coruña. April 28th, 2005 Laboratorio de Bases de Datos...

New Compression Codes for Text Databases

A Coruña. April 28th, 2005

Laboratorio de Bases de Datos

Universidade da Coruña

Ph.D. DissertationPh.D. DissertationAntonio Fariña Martínez

Nieves R. Brisaboa

Gonzalo Navarro

Advisors

April 2005 Phd dissertation – Antonio Fariña Martínez 2/85

— Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

— Dynamic CompressorsStatistical Dynamic compressors

Word-based Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

OutlineOutlineOutlineOutline

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

April 2005 Phd dissertation – Antonio Fariña Martínez 3/85

IntroductionIntroduction

In the last years …— Increase of number of text collections and their size

The Web, text databases (jurisprudence, linguistics, corporate data)

Compression reduces:— Storage space— Disk access time— Transmission time

Applications — Efficient retrieval. Searches can be even faster than

in plain text— Transmission of data (and also real time transmission)— …

April 2005 Phd dissertation – Antonio Fariña Martínez 4/85

IntroductionIntroduction

Importance of compression in Text Data Bases

— Reduces their size: 30% compression ratio

— It is possible to search the compressed text directlySearches are up to 8 times faster.

— Decompression is only needed for presenting results

— Compression can be integrated into Text Retrieval Systems, improving their performance

— …

April 2005 Phd dissertation – Antonio Fariña Martínez 5/85

Two big families of compressors

— Dictionary-based (gzip, compress,… )

— Statistical compressors (Huffman-based, arithmetic, PPM,… )

Statistical compressors

— Computation of the frequencies of the source symbols

— More frequent symbols shorter codewords

— A varying length codeword is given to each symbol

Obtain compression

IntroductionIntroduction

April 2005 Phd dissertation – Antonio Fariña Martínez 6/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-based Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 7/85

ScenariosScenarios

2 scenarios

— Semi-static compression (2-pass)Text storage

Text Retrieval

— Dynamic compression (1-pass)Transmission of data

Transmission of streams

April 2005 Phd dissertation – Antonio Fariña Martínez 8/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 9/85

Semi-static compressionSemi-static compression

Statistical compression

2 passes— Gather frequencies of source symbols and encoding— Compression (substitution source symbol codeword)

April 2005 Phd dissertation – Antonio Fariña Martínez 10/85

Semi-static compressionSemi-static compression

vocabulary

En un lugar de

la Mancha de

cuyo nombre

no quiero

acordarmeno

1

1

1

1

1

1

2

1

1

1

1

1

2

de 2

no 2

En 1

un 1

lugar 1

acordarme1

la 1

Mancha 1

cuyo 1

nombre 1

quiero 1

C1

C2

C3

C4

C5

C11

C6

C7

C8

C9

C10

1st pass: Text processing > Vocabulary sorting > Code generation

Source text

codeword

April 2005 Phd dissertation – Antonio Fariña Martínez 11/85

Semi-static compressionSemi-static compression

vocabulary

En un lugar de

la mancha de

cuyo nombre

no quiero

acordarmeno

C3 C4 C5 C1

C6

C11

C7 C8

C9 C2 C10

C1

C2 …

C1

C2

C3

C4

C5

C11

C6

C7

C8

C9

C10

de 2

no 2

En 1

un 1

lugar 1

acordarme1

la 1

Mancha 1

cuyo 1

nombre 1

quiero 1

codeword

2nd pass: Substitution word codeword

Source text

Compr.data

de*no*En*… header

Output file

April 2005 Phd dissertation – Antonio Fariña Martínez 12/85

Semi-static compressionSemi-static compression

Statistical compression

2 passes— Gather frequencies of words and encoding— Compression (substitution word codeword)

Association between source symbol codeword does not change across the text

— Direct search is possible

Most representative method: Huffman.

April 2005 Phd dissertation – Antonio Fariña Martínez 13/85

Classic HuffmanClassic Huffman

Optimal prefix code (i.e., shortest total length)

Character based (originally)— Characters are the symbols to be encoded

Bit oriented: Codewords are sequences of bits

A Huffman tree is built to generate codewords

Building a Huffman treeBuilding a Huffman tree Sorting symbols by frequency

Symbols A B C D E

Frequency / n 0.25 0.25 0.20 0.15 0.15

Codeword

April 2005 Phd dissertation – Antonio Fariña Martínez 14/85

Building a Huffman treeBuilding a Huffman tree Bottom-up construction

Symbols A B C D E

Frequency / n 0.25 0.25 0.20 0.15 0.15

Codeword

ED

0.30

0.15 0.15

April 2005 Phd dissertation – Antonio Fariña Martínez 15/85

Building a Huffman treeBuilding a Huffman tree Bottom-up construction

Symbols A B C D E

Frequency / n 0.25 0.25 0.20 0.15 0.15

Codeword

ED

0.30

0.15 0.15

ED

CB

0.25 0.20

0.45

April 2005 Phd dissertation – Antonio Fariña Martínez 16/85

Building a Huffman treeBuilding a Huffman tree Bottom-up construction

Symbols A B C D E

Frequency / n 0.25 0.25 0.20 0.15 0.15

Codeword

ED

0.30

0.15 0.15

ED

CB

0.25 0.20

0.45

A

ED

CB

0.25

0.55

April 2005 Phd dissertation – Antonio Fariña Martínez 17/85

Building a Huffman treeBuilding a Huffman tree Bottom-up construction

Symbols A B C D E

Frequency / n 0.25 0.25 0.20 0.15 0.15

Codeword

0.30

0.15 0.15

0.25 0.20

0.45

0.25

0.55

A

ED

CB

1.00

April 2005 Phd dissertation – Antonio Fariña Martínez 18/85

Building a Huffman treeBuilding a Huffman tree Labelling branches

Symbols A B C D E

Frequency / n 0.25 0.25 0.20 0.15 0.15

Codeword

ED

0.30

0.15 0.15

ED

CB

0.25 0.20

0.45

A

ED

CB

0.25

0.55

1.00

0 1

A

ED

CB

April 2005 Phd dissertation – Antonio Fariña Martínez 19/85

Building a Huffman treeBuilding a Huffman tree Labelling branches

Symbols A B C D E

Frequency / n 0.25 0.25 0.20 0.15 0.15

Codeword

0 1

1

A

0

1

E

0

D

1

C

0

B

0.30

0.15 0.15

0.25 0.20

0.45

0.25

0.55

1.00

April 2005 Phd dissertation – Antonio Fariña Martínez 20/85

Building a Huffman treeBuilding a Huffman tree Code Assignment

Symbols A B C D E

Frequency / n 0.25 0.25 0.20 0.15 0.15

Codeword

0 1

10

1

E

0

D

1

C

0

B

0.30

0.15 0.15

0.25 0.20

0.45

0.25

0.55

1.00

0010001110

A

01

April 2005 Phd dissertation – Antonio Fariña Martínez 21/85

Building a Huffman treeBuilding a Huffman tree

Ex: encoding ADB

Symbols A B C D E

Frequency / n 0.25 0.25 0.20 0.15 0.15

Codeword 01 10 11 000 001

0 1

1

A

0

1

E

0

D

1

C

0

B

0.30

0.25 0.25 0.20

0.55 0.45

0.15 0.15

1.00

Code Assignment

01 000 10

April 2005 Phd dissertation – Antonio Fariña Martínez 22/85

Classic HuffmanClassic Huffman

Optimal prefix code (i.e., shortest total length)

Character based (originally)— Characters are the symbols to be encoded

Bit oriented: Codewords are sequences of bits

A Huffman tree is built to generated codewords

Huffman tree has to be stored with the compressed text

Compression ratio about 60% (unpopular).

April 2005 Phd dissertation – Antonio Fariña Martínez 23/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word Oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 24/85

Word-based HuffmanWord-based Huffman

Moffat proposed the use of words instead of characters as the coding alphabet

Distribution of words more biased than the distribution of characters

0

2

4

6

8

10

12

14

16

18

E A O L S N D R U I T C P M Y Q B H G F V J Ñ Z X K W

n

Word freq distribution

Character freq distribution (Spanish)

Compression ratio about 25% (English texts) This idea joins the requirements of compression

algorithms and of Information Retrieval systems

April 2005 Phd dissertation – Antonio Fariña Martínez 25/85

Plain Huffman & Tagged HuffmanPlain Huffman & Tagged Huffman

Moura, Navarro, Ziviani and Baeza: — 2 new techniques: Plain Huffman and Tagged Huffman

Common features— Huffman-based

— Word-based

— Byte- rather than bit-oriented (compression ratio ±30%)

Plain Huffman = Huffman over bytes (256-ary tree) Tagged Huffman flags the beginning of each codeword

First bit is:•“1” for 1st bit of 1st byte•“0” for 1st bit remaining bytes

1xxxxxxx 0xxxxxxx 0xxxxxxx

April 2005 Phd dissertation – Antonio Fariña Martínez 26/85

Direct search (searches improved)Boyer-Moore

Random access (random decompression)

Plain Huffman & Tagged HuffmanPlain Huffman & Tagged Huffman

Differences:

— Plain Huffman. (tree of arity 2b=256)

— Tagged Huffman. (tree of arity 2b-1 =128)

Loss of compression ratio (3.5 perc. points)

Tag marks beginning of codewords

April 2005 Phd dissertation – Antonio Fariña Martínez 27/85

Plain Huffman & Tagged Huffman: searchesPlain Huffman & Tagged Huffman: searches

Example (b=2): to be lucky or not

Direct searching (compress pattern and search for it) Boyer-Moore type searching is possible (skipping bytes)

Searches are improved

to be lucky or not

1111 0011 00 11001100 01 10False matching

Searching for “lucky”

11 00lucky

10not

01or

00be

PLAIN HUFFMAN

11 11to

to be lucky or not

11010101 10 1110101001010100 1100 110100False matchings impossible

11 01 01 00lucky

11 01 00not

11 00or

10be

TAGGED HUFFMAN

11 01 01 01to

April 2005 Phd dissertation – Antonio Fariña Martínez 28/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 29/85

End-Tagged Dense CodeEnd-Tagged Dense Code

Small change: Flag signal the end of a codeword

Prefix code independently of the remaining 7 bits of each byte

Huffman tree is not needed: Dense coding

Flag bit Same Tagged Huffman searching capabilities

First bit is: “1” --> for 1st bit of last byte “0” --> for 1st bit remaining bytes

1xxxxxxx

0xxxxxxx

Two-byte codeword 1xxxxxxx0xxxxxxx

Three-byte codeword 1xxxxxxx0xxxxxxx0xxxxxxx

April 2005 Phd dissertation – Antonio Fariña Martínez 30/85

End-Tagged Dense CodeEnd-Tagged Dense Code Encoding scheme

.....

Words from 128+ 1282+1 to

128 +1282 +1283 use three bytes

(1283 = 221 codewords)

00000000:00000000:10000000

……

01111111:01111111:11111111

Words from 128+1 to 128+1282 are encoded using two bytes

(1282 = 214 codewords)

00000000:10000000

…..

01111111:11111111

First 128 words are encoded using one byte (27 codewords)

1000000010000001…..11111111

Codewords depend on the rank, not on the frequency

April 2005 Phd dissertation – Antonio Fariña Martínez 31/85

Sequential encoding procedure

End-Tagged Dense CodeEnd-Tagged Dense Code

On-the-fly encoding procedure

— Sorting words by frequency

— Code assignment ...0xxxxxxx

< 2b-1

0xxxxxxx < 2b-1

1xxxxxxx ≥ 2b-1

Ci encode(i)

i decode(Ci)

3210

131130129128

132

Ci = encode( i )

0 128

i = decode ( Ci )i=

Ci

April 2005 Phd dissertation – Antonio Fariña Martínez 32/85

End-Tagged Dense CodeEnd-Tagged Dense Code

Decompression: two steps— Load the ranked vocabulary

— i decode(Ci)

C2 C3 C4 C0

C5

C10

C6 C7

C8 C1 C9

C0

C1 …

Compr.data

de*no*En*… header

Compressed file

vocabulary

de

no

En

un

lugar

la

0

1

2

3

4

5

Plain text file

En un lugar de

la mancha de

cuyo nombre

no quiero

acordarmeno

……

April 2005 Phd dissertation – Antonio Fariña Martínez 33/85

End-Tagged Dense CodeEnd-Tagged Dense Code

It is a dense code. All the available codes can be used.— Compresses better than TH (2 percentage points).— It is overcome by PH (1 percentage point).

Flag same search capabilities than Tagged Huff.

— direct search,

— random access.

Efficient encoding and decoding

— Sequential and on-the-fly procedures

Easy to program

April 2005 Phd dissertation – Antonio Fariña Martínez 34/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 35/85

128 230

n=5,000

128 128 + 128x128 = 16,512

230 + 230x26 = 6,210230

128

n=5,000

128 128 + 128x128 = 16,512

(s,c)-Dense Code(s,c)-Dense Code• End Tagged Dense Code

– 27 available values [128, 255] for last byte (stoppers)– 27 available values [0, 127] for other bytes (continuers)

Adapting s,c values to the vocabulary s minimizing compr. text size— number of words

— distribution of frequency of the words End-Tagged Dense Code is a (128,128)-Dense Code

Why using fixed s,c values ?

April 2005 Phd dissertation – Antonio Fariña Martínez 36/85

(s,c)-Dense Code(s,c)-Dense Code

— Stoppers: last byte. Values in [0,s-1]— Continuers: other bytes. Values in [s,s+c-1]

0

...

s-1s more frequent words

s

s

...

s+c-1

0

1

...

s-1

Words from s+1 to s + sc(total sc words)

s...

s...

Words from s + sc + 1 to s + sc + sc2

(total sc2 words)0...

Encoding scheme

April 2005 Phd dissertation – Antonio Fariña Martínez 37/85

(s,c)-Dense Code(s,c)-Dense Code

Example (b=3)

1,301,161,071,03 Average codeword lenght   

0,010,010,010,01[111][010]0,005J

0,010,010,010,01[111][001]0,005I

0,040,040,040,04[111][000]0,02H

0,080,080,080,04[110]0,04G

0,180,180,090,09[101]0,09F

0,280,140,140,14[100]0,14E

0,150,150,150,15[011]0,15D

0,150,150,150,15[010]0,15C

0,200,200,200,20[001]0,20B

0,200,200,200,20[000]0,20A

ETDC(5,3)(6,2)PH(4,4)- DC(5,3)-DC(6,2)-DCP.H.FreqWord

End-Tagged Dense Code is a (2b-1,2b-1)-DC

[110][011]

[110][010]

[110][001]

[110][000]

[101]

[100]

[011]

[010]

[001]

[000]

[101][100]

[101][011]

[101][010]

[101][001]

[101][000]

[100]

[011]

[010]

[001]

[000]

[101][001]

[101][000]

[100][011]

[100][010]

[100][001]

[100][000]

[011]

[010]

[001]

[000]

April 2005 Phd dissertation – Antonio Fariña Martínez 38/85

Sequential encoding

(s,c)-Dense Code(s,c)-Dense Code

On-the-fly encoding is also possible

Ci encode(s, i)

i decode(s, Ci)

3210

S+3S+2S+1

s

S+4

Ci = encode( s,i )

s 0

i = decode ( s, Ci )i=

Ci

...xxxxxxxx xxxxxxxx zzzzzzzz

s≤ vc < 2b-1 0≤ vs< ss≤ vc< 2b-1

April 2005 Phd dissertation – Antonio Fariña Martínez 39/85

(s,c)-Dense Code(s,c)-Dense Code

It is a dense code— Compresses better than TH (3 percentage points)— Compresses better than ETDC (0.75 percentage points)— It is overcome by PH (0.25 percentage point)— RATIO: PH < SCDC << ETDC <<< TH

Flag? (byte value < s) — Same search capatilities than End-Tagged Dense Code

and Tagged Huffman

Simple encoding and decoding

April 2005 Phd dissertation – Antonio Fariña Martínez 40/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 41/85

Empirical ResultsEmpirical Results

We used some text collections from TREC-2 & TREC- 4, to perform thorough experiments

Methods compared— Plain Huffman & Tagged Huffman

— End-Tagged Dense Code

— (s,c)-Dense Code

Comparison in:— Compression ratio

— Encoding speed

— Compression speed

— Decompression speed

— Search speed

— Dual Intel Pentium-III 800 Mhz with 768Mb RAM. Debian GNU/Linux (kernel 2.2.19)

gcc 3.3.3 20040429 and –O9 optimizations

Time represents CPU user-time

CORPUS size (bytes) #words (N) voc. (n)

CALGARY 2,131,045 528,611 30,995

FT91 14,749,355 3,135,383 75,681

CR 51,085,545 10,230,907 117,713

FT92 175,449,235 36,803,204 284,892

ZIFF 185,220,215 40,866,492 237,622

FT93 197,586,294 42,063,804 291,427

FT94 203,783,923 43,335,126 295,018

AP 250,714,271 53,349,620 269,141

ALL FT 591,568,807 124,971,944 577,352

ALL 1,080,719,883 229,596,845 886,190

April 2005 Phd dissertation – Antonio Fariña Martínez 42/85

PH (s,c)-DC ETDC TH

28

29

30

31

32

33

34

35

com

pres

sion

ratio

(%)

technique

Compression RatioCompression Ratio

In general PH (s,c)-DC ETDC TH

0.2 pp 0.8 pp 2.5 pp

< < <

PH (s,c)-DC ETDC TH

30.73 30.88 31.56 34.16

April 2005 Phd dissertation – Antonio Fariña Martínez 43/85

Compression time & encoding timeCompression time & encoding time

Encoding

Vocabulary extraction

File processing

n1

Sorted Words vector

word

freq

Huffman

ET

DC

Sequential codegeneration

Creating Huffman tree Search for optimal (s,c) values

Accumulated listof frequencies

Find Best S

Sequential codegeneration

Hash table

Compression phase

To

T1

Bui

ld tr

ee

Set depths

Cod

e ge

ner

atio

n

(s-c) DC

wordcode

com

pre

ssio

n

1st p

ass

2n

d p

ass

April 2005 Phd dissertation – Antonio Fariña Martínez 44/85

PH (s,c)-DC ETDC TH

100

120

140

160

180

200

220

240

260

280

Enc

odin

g tim

e (m

sec.

)

technique

Encoding timeEncoding time

ETDC (s,c)-DC PH TH

25% 45% 2%

In general

260 143 104 270

< < <

PH (s,c)-DC ETDC TH

April 2005 Phd dissertation – Antonio Fariña Martínez 45/85

PH (s,c)-DC ETDC TH

5.3

5.4

5.5

5.6

5.7

5.8

5.9

6

6.1

6.2

Com

pres

sion

spe

ed (M

byte

s/se

c)

technique

Compression speedCompression speed

PH ETDC (s,c)-DC TH= > > In general

5.92 5.88 5.90 5.83

PH (s,c)-DC ETDC TH

April 2005 Phd dissertation – Antonio Fariña Martínez 46/85

PH (s,c)-DC ETDC TH

20.5

21

21.5

22

22.5

23

23.5

24

24.5

25

Deo

mpr

essi

on s

peed

(Mby

tes/

sec)

technique

Decompression speedDecompression speed

23.86 23.55 24.15 22.51

ETDC PH (s,c)-DC TH=

1.5% 4%

In general > >

PH (s,c)-DC ETDC TH

April 2005 Phd dissertation – Antonio Fariña Martínez 47/85

PH (s,c)-DC ETDC TH

1.6

1.7

1.8

1.9

2

2.1

2.2

2.3

2.4

2.5

Searc

h ti

me (se

c.)

technique

(s,c)-DC

Search timeSearch time

ETDC TH PH

5% 5-10% 10%

3-byte codes < < <

2.30 1.70 1.80 2.00

PH (s,c)-DC ETDC TH

April 2005 Phd dissertation – Antonio Fariña Martínez 48/85

Semi-static compression: SummarySemi-static compression: Summary

Two new semi-static “dense” compressors: ETDC, SCDC Simpler and faster encoding scheme than Huffman-based

ones— Sequential encoding

— on-the-fly encoding

Allowing direct search and random access

Speed: Good compression and decompression speed

Compression ratio close to Plain Huffman

Overcoming Tagged Huffman in:— Compression ratio, compression speed, decompression

speed and searches.

April 2005 Phd dissertation – Antonio Fariña Martínez 49/85

Semi-static compression: SummarySemi-static compression: Summary

100 120 140 160 180 200 220 240 260 28030

31

32

33

34

35

encoding time (msec)

com

pres

sion

ratio

18 18.2 18.4 18.6 18.8 19 19.2 19.4 19.6 19.8 2030

31

32

33

34

35

search time (sec)

com

pres

sion

ratio

Plain HuffmanTagged Huffman(s,c)-Dense CodeEnd-Tagged Dense Code

Plain HuffmanTagged Huffman(s,c)-Dense CodeEnd-Tagged Dense Code

April 2005 Phd dissertation – Antonio Fariña Martínez 50/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 51/85

Dynamic compressorsDynamic compressors

One-pass compressors

Do not allow Text Retrieval* — Varying assotiation word codeword

Well-suited for file transmission and storage— Transmission of streams of text— Real-time transmission— Paralelism of:

Compression - transmision & reception – decompression

April 2005 Phd dissertation – Antonio Fariña Martínez 52/85

Motivation: transmission scenariosMotivation: transmission scenarios

compression

transmission

reception

decompression

ex: chat/talk service

Transmission using dynamicdynamic compression

Hi JohnD id youknow …?

April 2005 Phd dissertation – Antonio Fariña Martínez 53/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 54/85

Statistical dynamic codesStatistical dynamic codes

Dynamic model. One pass techniques

Frequencies are computed dynamically as text arrives.

Sender and receiver…

— Start with an empty vocabulary.

— Adapt their vocabulary during the process.

— Both processes are symmetric

April 2005 Phd dissertation – Antonio Fariña Martínez 55/85

Statistical dynamic codesStatistical dynamic codes

rosa

senderreceiver

Example: una rosa rosarosa es una rosa

rosa

una

rosa

C1

C2

C3

C4

vocabulary

1

C2

rosa

21

rosauna

rosa

C1

C2

C3

C4

vocabulary

1

21

April 2005 Phd dissertation – Antonio Fariña Martínez 56/85

Statistical dynamic codesStatistical dynamic codes

1

rosa

senderreceiver

es

una

C1

C2

C3

C4

vocabulary

2

C3

1

rosa

una

C1

C2

C3

C4

vocabulary

2

es

es

es 1

Example: una rosa rosa eses una rosa

es 1

April 2005 Phd dissertation – Antonio Fariña Martínez 57/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 58/85

Word-based Byte-oriented Dynamic HuffmanWord-based Byte-oriented Dynamic Huffman

Character-based dynamic Huffman: — [Fal73, Gal78], [Knu85] FGK algorithm, [Vit87]

Our implementation:— Word-based

Use of a hash table to hold words (and to find them efficiently)

— Byte oriented Worsens compression ratio but improves decompression speed256-ary Huffman tree More complex than bit oriented.

— Well-formed Huffman tree during compression/decompr.Update of the tree is a complex task (but height <4)

— Compression ratio about 30-35 %.

April 2005 Phd dissertation – Antonio Fariña Martínez 59/85

Word-based Byte-oriented Dynamic Huffman: Ex.Word-based Byte-oriented Dynamic Huffman: Ex.

e

New letter

a9

31

b8

c7

d6

7

e1

-0

a9

30

b8

c7

d6

6

-0

e

Example: b=2 the arity of the tree is 22 = 4

April 2005 Phd dissertation – Antonio Fariña Martínez 60/85

Word-based Byte-oriented Dynamic Huffman: Ex.Word-based Byte-oriented Dynamic Huffman: Ex.

a9

31

b8

c7

d6

7

e1

-0

a9

32

b8

c7

d6

8

e2

-0

e

Increase weight

e

April 2005 Phd dissertation – Antonio Fariña Martínez 61/85

Word-based Byte-oriented Dynamic Huffman: Ex.Word-based Byte-oriented Dynamic Huffman: Ex.

a9

34

b8

c7

d6

10

e3

f1

-0

g

g

New letter

a9

35

b8

c7

d6

11

e3

f1

g1

1

-0

April 2005 Phd dissertation – Antonio Fariña Martínez 62/85

Word-based Byte-oriented Dynamic HuffmanWord-based Byte-oriented Dynamic Huffman

Cost of compression is O(# target symbols output)— Byte-oriented O(# bytes output)— Bit-oriented O(# bits output)

The height of the tree is

lower in the byte-oriented approach

Less propagations of changes upwards in the tree

April 2005 Phd dissertation – Antonio Fariña Martínez 63/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 64/85

Exchange with top send codeword increase frequency

Dynamic End-Tagged Dense CodeDynamic End-Tagged Dense Code

It uses the on-the-fly encoding and decoding algorithms— Requirement: maintaining the vocabulary sorted by freq.

speed

vocabulary

1 the 8

2 is 4

3 code 2

4 tag 2

5 Huffman 2

6

7 ratio 1

8 1

ETDC

speed

12

C8 = encode (8)

Example

April 2005 Phd dissertation – Antonio Fariña Martínez 65/85

exchange with top send codeword increase frequency

Dynamic End-Tagged Dense CodeDynamic End-Tagged Dense Code

It uses the on-the-fly encoding and decoding algorithms— Requirement: maintaining the vocabulary sorted by freq.

speed

vocabulary

1 the 8

2 is 4

3 2

4 tag 2

5 Huffman 2

6

7 ratio 1

8 1ETDC

2 C6 = encode (6)

code

speed

3

Example

April 2005 Phd dissertation – Antonio Fariña Martínez 66/85

Dynamic End-Tagged Dense Code: SummaryDynamic End-Tagged Dense Code: Summary

It is much simpler than the Dynamic Huffman— Maintaining the ranked words Vs maintaining a Huffman

tree.

— Performs one swap per codeword O(1) Vs O(height Huffman tree)

— Cost of compressionO(# codewords output) Vs O(# bytes output)

April 2005 Phd dissertation – Antonio Fariña Martínez 67/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 68/85

Dynamic (s,c)-Dense CodeDynamic (s,c)-Dense Code

Generalization of Dynamic ETDC— On-the-fly encoding and decoding,— Maintaining the sorted vocabulary, and— Maintaining an optimal value of s

Two methods were designed:

Counting bytes approach

Ranges approach

April 2005 Phd dissertation – Antonio Fariña Martínez 69/85

Evolution of the value of Evolution of the value of ss

• s = 256 while n < 256 only one-byte codewords• s = 255 while n < 510 using two-byte codewords• …• s falls up to s=129 (around n = 16,512) three-byte codewords are

used (s = 129 maximices s + sc = # 1- or 2-byte codewords) … • Then … s value grows to improve compression in high frequency words

0 50,000 100,000 150,000 200,000 250,000

140

160

180

200

220

240

n

s v

alu

e

s=129

Ap Newswire corpus

0 5,000 10,000 15,000 20,000 25,000 30,000

140

160

180

200

220

240

n

s v

alue

s=129

Calgary corpus

April 2005 Phd dissertation – Antonio Fariña Martínez 70/85

Maintaining the optimal Maintaining the optimal ss and and cc values values

After processing a source word…— s could be increased by one: s s +1— s could be decreased by one s s -1

Three variables prev, s0, and next to accumulate the size of the compressed text depending on the s value used.

prev Size of Compr. Text when #stoppers = s-1

Size of Compr. Text when #stoppers = ss0

Size of Compr. Text when #stoppers = s+1next

If prev < s0 then s s-1

If next < s0 then s s+1

April 2005 Phd dissertation – Antonio Fariña Martínez 71/85

Maintaining the optimal Maintaining the optimal ss and and cc values values

Let assume prev < s0 then s s-1

How to initialize prev?

prev

5 6 9s0 next

0 0s0 nextprev

0

s s+1s -1s -2

Lose history

Initialization to 0

prev

5 6 9s0 next

5 6s0 nextprev

5

s s+1s -1s -2

Save history (next s0+1)

Initialization to S0

Using a threshold? less changes loss of compression

April 2005 Phd dissertation – Antonio Fariña Martínez 72/85

Dynamic (s,c)-Dense Code: SummaryDynamic (s,c)-Dense Code: Summary

Generalization of Dynamic End-Tagged Dense Code

— Same idea:Using swaps to maintain the vocabulary sorted

— Extra operations: Maintaining the optimal value of s

April 2005 Phd dissertation – Antonio Fariña Martínez 73/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 74/85

Empirical ResultsEmpirical Results

Techniques compared:— Semi-static and dynamic versions of:

Plain Huffman (s,c)-Dense Code End-Tagged Dense Code

— Arithmetic word-based bit-oriented compressor

— Gzip

— Bzip2

Comparison in:— Compression ratio

— Compression speed

— Decompression speed

April 2005 Phd dissertation – Antonio Fariña Martínez 75/85

Our dynamic techniquesOur dynamic techniques

D-PH D-SCDC D-ETDC

31.5

32

32.5

com

pres

sion

rat

io (

%)

D-PH D-SCDC D-ETDC55006000650070007500

com

pres

sion

spe

edK

byte

s/se

c

D-PH D-SCDC D-ETDC

8000

10000

12000

14000

16000

deco

mpr

essi

on s

peed

Kby

tes/

sec

31.71 31.84 32.54

5,724 6,944 7,459

7,791 14,067 14,800

Compr. ratio (%)

Compr. Speed(Kbytes/sec)

Decompr. Speed(Kbytes/sec)

Huffman best compression but much slower than DETDC

DSCDC 5% slower than DETDC but better compression ratio

FT_ALL

April 2005 Phd dissertation – Antonio Fariña Martínez 76/85

Dynamic Vs semi-static approachesDynamic Vs semi-static approaches

Compression ratio: dynamism causes negligible loss of compressionCompression speed: 1-pass are faster (except DPH in large texts)Decompression speed: 2-pass are much faster (no updates)

PH SCDC ETDC TH D-PH D-SCDC D-ETDC

25

30

35

PH SCDC ETDC TH D-PH D-SCDC D-ETDC

5500

6000

6500

7000

7500

com

pres

sion

spe

edK

byte

s/se

c

PH SCDC ETDC TH D-PH D-SCDC D-ETDC

8000

10000

12000

deco

mpr

essi

on s

peed

Kby

tes/

sec

PH SCDC ETDC TH D-PH D-SCDC D-ETDC

25

30

35

PH SCDC ETDC TH D-PH D-SCDC D-ETDC

5500

6000

6500

7000

7500

com

pres

sion

spe

edK

byte

s/se

c

PH SCDC ETDC TH D-PH D-SCDC D-ETDC

8000

10000

12000

deco

mpr

essi

on s

peed

Kby

tes/

sec

Compression ratio (%)

Compression Speed

(Kbytes/sec)

Decompr. Speed

(Kbytes/sec)

31.696 31.839 32.527 31.710 31.849 32.537

5,922 5,882 5,888 5,724 6,944 7,459

23,856 23,552 24,146 7,791 14,067 14,800

FT_ALL Semi-static Dynamic

CALGARY FT91 CR FT92 ZIFF FT93 FT94 AP FT_ALL ALL

30

40

50

60

70

80

diff in decompression speed semi-static VS dynamic

% g

ain

ETDC SCDC PH

CALGARY FT91 CR FT92 ZIFF FT93 FT94 AP FT_ALL ALL-10

0

10

20

30

40

50

% lo

ss

diff in compression speed semi-static VS dynamic

ETDC SCDC PH

Gain in compression speed of the dynamic methods G

ain

(%

)

Loss in decompression speed of the dynamic methods lo

ss (

%)

April 2005 Phd dissertation – Antonio Fariña Martínez 77/85

Comparison with gzip, bzip2 and arithmeticComparison with gzip, bzip2 and arithmetic

D-PH D-SCDC D-ETDC arith gzip -f gzip -b bzip2 -f bzip2 -b

25

30

35

40

D-PH D-SCDC D-ETDC arith gzip -f gzip -b bzip2 -f bzip2 -b

2000

4000

6000

com

pres

sion

spe

edK

byte

s/se

c

D-PH D-SCDC D-ETDC arith gzip -f gzip -b bzip2 -f bzip2 -b

8000

10000

12000

14000

deco

mpr

essi

on s

peed

Kby

tes/

sec

Compression ratio (%)

Compression Speed

(Kbytes/sec)

Decompr. Speed

(Kbytes/sec)

31.710 31.849 32.537 27.852 40.988 34.845 31.152 25.865

5,724 6,944 7,459 2,157 5,045 2,015 1,059 824

7,791 14,067 14,800 2,762 17,249 19,440 3,115 2,513

Compression ratio: bzip2 is the best. Our compressors overcome gzipCompression speed: Our compressors are faster than the othersDecompression speed: gzip is the fastest. DETDC and DSCDC the next ones

FT_ALL

April 2005 Phd dissertation – Antonio Fariña Martínez 78/85

Dynamic compression: SummaryDynamic compression: Summary

3 dynamic techniques.— Dyn-Huffman compresses more than the others— DETDC is the fastest (DSCDC is close to DETDC)

Dynamism:— Little impact in compression ratio— 1-pass faster than 2-pass at compression— 2-pass much faster at decompression

They are a valuable contribution— Less compression than bzip2 and arith. But much faster— More compression ratio and speed than gzip. Slower at

decompression

April 2005 Phd dissertation – Antonio Fariña Martínez 79/85

Dynamic compression: SummaryDynamic compression: Summary

0 100 200 300 400 500 600 700 80025

30

35

40

45

compression time (sec)

com

pres

sion

ratio

(%)

Dynamic PH

Dynamic (s,c)-Dense Code

Dynamic End-Tagged Dense CodeArithmetic encoder

gzip -f

bzip2 -b

0 50 100 150 200 25025

30

35

40

45

decompression time (sec)

com

pres

sion

ratio

(%)

Dynamic PH

Dynamic (s,c)-Dense Code

Dynamic End-Tagged Dense CodeArithmetic encoder

gzip -f

bzip2 -b

April 2005 Phd dissertation – Antonio Fariña Martínez 80/85

—Semi-static CompressorsWord-based Huffman: Plain & Tagged Huffman

End-Tagged Dense Code

(s,c)-Dense Code

Empirical Results

—Dynamic CompressorsStatistical Dynamic compressors

Word-oriented Huffman

Dynamic End-Tagged Dense Code

Dynamic (s,c)-Dense Code

Empirical Results

Conclusions and Contributions of the Thesis

Two Scenarios

Introduction

OutlineOutlineOutlineOutline

April 2005 Phd dissertation – Antonio Fariña Martínez 81/85

Conclusions: ContributionsConclusions: Contributions

“Dense” family of word-based byte-oriented compressors— Simple and fast encoding & decoding.

Semi-static compressors— Compression ratio close to that of Plain Huffman— Same searching capabilities than Tagged Huffman.— Overcome Tagged Huffman in all aspects.

Dynamic compressors— Good compression ratio (small cost of dynamism)— Very fast at compression — Good decompression speed

April 2005 Phd dissertation – Antonio Fariña Martínez 82/85

Conclusions: Concrete contributionsConclusions: Concrete contributions

— End-Tagged Dense Code— (s,c)-Dense Code

— New bounds on D-ary Huffman codes.

Semi-static

Dynamic—Dynamic word-based byte-oriented Huffman—Dynamic End-Tagged Dense Code—Dynamic (s,c)-Dense Code

Efficient implementations

Complete experimental validations

Contributions of the thesis

April 2005 Phd dissertation – Antonio Fariña Martínez 83/85

Future WorkFuture Work

Searchable dynamic compression— Adaptation of the D-ETDC— Faster at decompression.— Accepted in SIGIR’05

Compression of growing collections— Permits adding more text to a compressed corpus — Avoids re-compression — Idea: Encoding phrases (for words that become very

frequent) with a unique codeword, instead of recompression

— Submitted to ECDL’05

April 2005 Phd dissertation – Antonio Fariña Martínez 84/85

Research ResultsResearch Results

Ing. Software en la Década del 20001Book Chapter

National. Conferen.

Internat. Conferen.

JISBD’04, JISBD’03, JBIDI’033

SIGIR’05, SPIRE’04, SCCC’04, VODCA’04, SPIRE’035

Accepted papers

Internat. Conference ECDL’051

Submitted papers

International Journals Information Retrieval (IR) Information Processing and Management (IPM)

2

Manuscripts to submit

DCC (Univ. Chile) 2003:2003: August 26th - October 6th

2004:2004: December 2th - December 16th

2

Research Stays

April 2005 Phd dissertation – Antonio Fariña Martínez 85/85

That’s all !! That’s all !!

Thank you very much !!

…?