UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding...

54
UNIT II TEXT COMPRESSION

Transcript of UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding...

Page 1: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

UNIT II

TEXT COMPRESSION

Page 2: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

a.

Outline

Compression techniques Run length codingHuffman codingAdaptive Huffman CodingArithmetic coding Shannon-Fano coding Dictionary techniques LZW family algorithms.  

Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW family algorithms. Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW family algorithms. Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW family algorithms. Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW family algorithms.

Page 3: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

IntroductionIntroductionCompression: the process of coding that will effectively reduce

the total number of bits needed to represent certain information.

3

Fig.1: A General Data Compression Scheme.

Page 4: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

IntroductionIntroduction If the compression and decompression processes induce no

information loss, then the compression scheme is lossless; otherwise, it is lossy.

Compression ratio:

B0 – number of bits before compressionB1 – number of bits after compression In general, we would desire any codec (encoder/decoder

scheme) to have a compression ratio much larger than 1.0. The higher the compression ratio, the better the lossless

compression scheme, as long as it is computationally feasible.

5

Page 5: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

Basics of Information TheoryBasics of Information Theory

6

What is entropy? is a measure of the number of specific ways in which a system may be arranged, commonly understood as a measure of the disorder of a system.

As an example, if the information source S is a gray-level digital image, each si is a gray-level intensity ranging from 0 to (2k − 1), where k is the number of bits used to represent each pixel in an uncompressed image.

We need to find the entropy of this image; which the number of bits to represent the image after compression.

Page 6: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

Run-Length CodingRun-Length Coding• RLC is one of the simplest forms of data compression.The basic idea is that if the information source has the property that

symbols tend to form continuous groups, then such symbol and the length of the group can be coded.

Consider a screen containing plain black text on a solid white background.

There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. Let us take a hypothetical single scan line, with B representing a black pixel and W representing white: WWWWWBWWWWBBBWWWWWWBWWW

If we apply the run-length encoding (RLE) data compression algorithm to the above hypothetical scan line, we get the following: 5W1B4W3B6W1B3W

The run-length code represents the original 21 characters in only 14.

7

Page 7: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

Variable-Length CodingVariable-Length Coding

8

Variable-length coding (VLC) is one of the best-known entropy coding methods

Here, we will study the Shannon–Fano algorithm, Huffman coding, and adaptive Huffman coding.

Page 8: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

Shannon–Fano AlgorithmShannon–Fano Algorithm

9

To illustrate the algorithm, let us suppose the symbols to be coded are the characters in the word HELLO.

The frequency count of the symbols is

Symbol H E L O

Count 1 1 2 1The encoding steps of the Shannon–Fano algorithm can be

presented in the following top-down manner:1. Sort the symbols according to the frequency count of their

occurrences.2. Recursively divide the symbols into two parts, each with

approximately the same number of counts, until all parts contain only one symbol.

Page 9: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

Shannon–Fano AlgorithmShannon–Fano Algorithm

10

A natural way of implementing the above procedure is to build a binary tree.

As a convention, let us assign bit 0 to its left branches and 1 to the right branches.

Initially, the symbols are sorted as LHEO.As Fig. 7.3 shows, the first division yields two parts: L with a

count of 2, denoted as L:(2); and H, E and O with a total count of 3, denoted as H, E, O:(3).

The second division yields H:(1) and E, O:(2).The last division is E:(1) and O:(1).

Page 10: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

Shannon–Fano AlgorithmShannon–Fano Algorithm

11Fig. 7.3: Coding Tree for HELLO by Shannon-Fano.

Page 11: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

Table 7.1: Result of Performing Shannon-Fano on HELLO

Li & Drew 12

Symbol Count Log2 Code # of bits used

L 2 1.32 0 2

H 1 2.32 10 2

E 1 2.32 110 3

O 1 2.32 111 3

TOTAL # of bits: 10

1pi

Page 12: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

Another coding tree for HELLO by Shannon-Fano.

Li & Drew 13

Page 13: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

Another Result of Performing Shannon-Fanoon HELLO (see Fig. 7.4)

Li & Drew 14

Symbol Count Log2 Code # of bits used

L 2 1.32 00 4

H 1 2.32 01 2

E 1 2.32 10 2

O 1 2.32 11 2

TOTAL # of bits: 10

1pi

Page 14: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

Shannon–Fano Algorithm-AnalysisShannon–Fano Algorithm-Analysis

15

The Shannon–Fano algorithm delivers satisfactory coding results for data compression, but it was soon outperformed and overtaken by the Huffman coding method.

The Huffman algorithm requires prior statistical knowledge about the information source, and such information is often not available.

This is particularly true in multimedia applications, where future data is unknown before its arrival, as for example in live (or streaming) audio and video.

Even when the statistics are available, the transmission of the symbol table could represent heavy overhead

The solution is to use adaptive Huffman coding compression algorithms, in which statistics are gathered and updated dynamically as the data stream arrives.

Page 15: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

15.16

LOSSLESS COMPRESSIONLOSSLESS COMPRESSION

In In losslesslossless data compression, the integrity of the data is data compression, the integrity of the data is preserved. The original data and the data after compression preserved. The original data and the data after compression and decompression are exactly the same because, in these and decompression are exactly the same because, in these methods, the compression and decompression algorithms are methods, the compression and decompression algorithms are exact inverses of each other: no part of the data is lost in the exact inverses of each other: no part of the data is lost in the process. Redundant data is removed in compression and process. Redundant data is removed in compression and added during decompression. Lossless compression methods added during decompression. Lossless compression methods are normally used when we cannot afford to lose any data. are normally used when we cannot afford to lose any data.

Page 16: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

15.17

Run-length encoding

Run-length encoding is probably the simplest method of compression. It can be used to compress data made of any combination of symbols. It does not need to know the frequency of occurrence of symbols and can be very efficient if data is represented as 0s and 1s.

The general idea behind this method is to replace consecutive repeating occurrences of a symbol by one occurrence of the symbol followed by the number of occurrences.

The method can be even more efficient if the data uses only two symbols (for example 0 and 1) in its bit pattern and one symbol is more frequent than the other.

Page 17: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

15.18

Run-length encoding example

Page 18: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

15.19

Run-length encoding for two symbols

Page 19: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

15.20

Huffman coding

Huffman coding assigns shorter codes to symbols that occur more frequently and longer codes to those that occur less frequently. For example, imagine we have a text file that uses only five characters (A, B, C, D, E). Before we can assign bit patterns to each character, we assign each character a weight based on its frequency of use. In this example, assume that the frequency of the characters is as shown in Table 15.1.

Page 20: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

15.21

Huffman coding

Page 21: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

15.22

A character’s code is found by starting at the root and following the branches that lead to that character. The code itself is the bit value of each branch on the path, taken in sequence.

Final tree and code

Page 22: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

15.23

Encoding

Let us see how to encode text using the code for our five characters. Figure 15.6 shows the original and the encoded text.

Huffman encoding

Page 23: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

15.24

Decoding

The recipient has a very easy job in decoding the data it receives. shows how decoding takes place.

Huffman decoding

Page 24: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

page 2505/06/15 CSE 40373/60373: Multimedia Systems

Adaptive Huffman Coding

Extended Huffman is in book: group symbols together

Adaptive Huffman: statistics are gathered and updated dynamically as the data stream arrives

ENCODER-------Initial_code();

while not EOF {

get(c); encode(c); update_tree(c);

}

DECODER-------Initial_code();

while not EOF { decode(c);

output(c);update_tree(c);

}

Page 25: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

26

Adaptive Coding

Motivations: The previous algorithms (both Shannon-Fano and Huffman) require the

statistical knowledge which is often not available (e.g., live audio, video). Even when it is available, it could be a heavy overhead. Higher-order models incur more overhead. For example, a 255 entry

probability table would be required for a 0-order model. An order-1 model would require 255 such probability tables. (A order-1 model will consider probabilities of occurrences of 2 symbols)

The solution is to use adaptive algorithms. Adaptive Huffman Coding is one such mechanism that we will study.

The idea of “adaptiveness” is however applicable to other adaptive compression algorithms.

Page 26: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

27

Adaptive Coding

ENCODER                           Initialize_model();do {

c = getc( input );encode( c, output );update_model( c );

} while ( c != eof)

DECODER

Initialize_model();

while ( c = decode (input)) != eof) {

putc( c, output)

update_model( c );

}

The key is that, both encoder and decoder use exactly the same initialize_model and update_model routines.

Page 27: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

28

The Sibling Property

The node numbers will be assigned in such a way that:1. A node with a higher weight will have a higher node

number2. A parent node will always have a higher node number

than its children. In a nutshell, the sibling property requires that the nodes (internal

and leaf) are arranged in order of increasing weights.The update procedure swaps nodes in violation of the sibling

property. The identification of nodes in violation of the sibling

property is achieved by using the notion of a block. All nodes that have the same weight are said to belong to

one block

Page 28: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

29

Flowchart of the update procedure

START

First appearance of symbol

Go to symbol external node

Node number max

in block?

Incrementnode weight

Switch node withhighest numbered

node in block

Is thisthe rootnode?

Go toparent node

STOP

NYT gives birthTo new NYT and

external node

Increment weight of external node

and old NYT node; Adjust node

numbers

Go to oldNYT node

Yes

No

No

Yes

Yes

No

The Huffman tree is initialized with a single node, known as the Not-Yet-Transmitted (NYT) or escape code. This code will be sent every time that a new character, which is not in the tree, is encountered, followed by the ASCII encoding of the character. This allows for the de-compressor to distinguish between a code and a new character. Also, the procedure creates a new node for the character and a new NYT from the old NYT node.

The root node will have the highest node number because it has the highest weight.

Page 29: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

30

Example

BW=2#1

CW=2#2

DW=2#3

W=2#4

W=4#5

W=6#6

EW=10

#7

Root W=16

#8Counts:

(number of occurrences)

B:2C:2D:2E:10

Example Huffman tree after some symbols have been processed in accordance with the sibling property

NYT

NYT

#0

Initial Huffman Tree

#0

Page 30: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

31

Example

W=1#2

BW=2#3

CW=2#4

DW=2#5

W=2+1

#6

W=4#7

W=6+1

#8

EW=10

#9

Root W=16+

1#10

Counts:(number of

occurrences)A:1B:2C:2D:2E:10

A Huffman tree after first appearance of symbol A

AW=1#1

NYT

#0

Page 31: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

32

Increment

BW=2#3

CW=2#4

DW=2#5

W=3+1#6

W=4#7

W=7+1#8

EW=10

#9

Root W=17+1

#10

Counts:A:1+1

B:2C:2D:2E:10

An increment in the count for A propagates up to the root

W=1+1#2

AW=1+1

#1

NYT

#0

Page 32: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

33

Swapping

BW=2#3

CW=2#4

DW=2#5

W=4#6

W=4#7

W=8#8

EW=10

#9

Root W=18#10

Counts:A:2+1

B:2C:2D:2E:10

BW=2#3

CW=2#4

AW=2+

1#5

W=4#6

W=4+1

#7

W=8+1

#8

EW=10

#9

Root W=18+

1#10

Counts:A:3B:2C:2D:2E:10

Swap nodes 1 and 5

Another increment in the count for A results in swap

W=2#2

AW=2#1

NYT

W=2#2

DW=2#1

NYT

#0

#0

Page 33: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

34

Swapping … contd.

BW=2#3

CW=2#4

AW=3+1

#5

W=4#6

W=5+1

#7

W=9+1#8

EW=10

#9

Root W=19+1

#10

Counts:A:3+1

B:2C:2D:2E:10

Another increment in the count for A propagates up

W=2#2

DW=2#1

NYT

#0

Page 34: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

35

Swapping … contd.

BW=2#3

CW=2#4

AW=4#5

W=4#6

W=6#7

W=10#8

EW=10

#9

Root W=20#10

Counts:A:4+1

B:2C:2D:2E:10

Swap nodes 5 and 6

Another increment in the count for A causes swap of sub-tree

W=2#2

DW=2#1

NYT

#0

Page 35: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

36

Swapping … contd.

CW=2#4

W=6#7

W=10#8

EW=10

#9

Root W=20#10

Counts:A:4+1

B:2C:2D:2E:10

BW=2#3

W=4#5

AW=4+1

#6

Swap nodes 8 and 9

Further swapping needed to fix the tree

W=2#2

DW=2#1

NYT

#0

Page 36: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

37

Swapping … contd.

CW=2#4

W=6#7

W=10+1

#9

Root W=20+1

#10

Counts:A:5B:2C:2D:2E:10

BW=2#3

W=4#5

AW=5#6

EW=10

#8

W=2#2

DW=2#1

NYT

#0

Page 37: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

38

Arithmetic Coding

Arithmetic coding is based on the concept of interval subdividing.

In arithmetic coding a source ensemble is represented by an interval between 0 and 1 on the real number line. Each symbol of the ensemble narrows this interval. As the interval becomes smaller, the number of bits needed to specify it grows. Arithmetic coding assumes an explicit probabilistic model of the source.It uses the probabilities of the source messages to successively narrow the interval used to represent the ensemble.

• A high probability message narrows the interval less than a low probability message, so that high probability messages contribute fewer bits to the coded ensemble

Page 38: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

39

Arithmetic Coding: DescriptionIn the following discussions, we will use M as the size of the alphabet of the data source, N[x] as symbol x's probability, Q[x] as symbol x's cumulative probability (i.e., Q[i]=N[0]+N[1]+...+N[i])

Assume we know the probabilities of each symbol of the data source, we can allocate to each symbol an interval with width proportional to its probability, and each of the intervals does not overlap with others. This can be done if we use the cumulative probabilities as the two ends of each interval. Therefore, the two ends of each symbol x amount to Q[x-1] and Q[x]. Symbol x is said to own the range [Q[x-1], Q[x]).

Page 39: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

40

Arithmetic Coding: Encoder

We begin with the interval [0,1) and subdivide the interval iteratively.

For each symbol entered, the current interval is divided according to the probabilities of the alphabet.

The interval corresponding to the symbol is picked as the interval to be further proceeded with.

The procedure continues until all symbols in the message have been processed.

Since each symbol's interval does not overlap with others, for each possible message there is a unique interval assigned.

We can  represent the message with the interval's two ends [L,H). In fact, taking any single value in the interval as the encoded code is enough, and usually the left end L is selected.

Page 40: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

41

Arithmetic Coding Algorithm

L = 0.0; H = 1.0;While ( (x = getc(input)) != EOF ){ R = (H-L); H = L + R * Q[x]; L = L + R * Q[x-1];}Output(L);

R is the interval range, and H and L are two ends of the current code interval. x is the new symbol to be encoded.

H and L are initialized to 0 and 1 respectively 

Page 41: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

42

Arithmetic Coding: Encoder example

Symbol, x Probability, N[x] [Q[x-1], Q[x])

A 0.4 0.0, 0.4

B 0.3 0.4, 0.7

C 0.2 0.7, 0.9

D 0.1 0.9, 1.0

1

0

B

0.4

0.7 0.67

0.61

C

0.634

0.61

A

0.6286

0.6196

B String: BCAB

Code sent: 0.6196

Page 42: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

43

Decoding Algorithm When decoding the code v is placed on the current code interval

to find the symbol x so that Q[x-1] <= code < Q[x]. The procedure iterates until all symbols are decoded.

v = input_code();for (;;) {

x = find_symbol_straddling_this_range(v);putc(x);R = Q[x] – Q[x-1];v = (v – Q[x-1])/R;

}

v Output Char

x

Q[x-1] Q[x] R

0.6196 B 0.4 0.7 0.3

0.732 C 0.7 0.9 0.2

0.16 A 0.0 0.4 0.4

0.4 B 0.4 0.7 0.3

0.0

Page 43: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

44

Arithmetic Coding: Issues

The zero-frequency problem: Each symbol's predicted probability must not be zero or the interval will become zero and interval renormalization would fail. This is called the zero-frequency problem. Models that adapt online may encounter such problem when decaying. The EOF problem: Assume we pick the lower end of the interval as the encoded code. Two messages may yield the same code if one message is identical to the other, except for a sequence of finite number of the first symbol(first in table, not in the sequence) as a suffix.

• For e.g., Both BCAB, BCABA, BCABAA, BCABAAA will have the same lower interval but different upper intervals. (try it)

The simplest solution is to let the decoder know the length of the encoded message. The decoder will know if the message size is fixed or can be transmitted at first. However this is not plausible if the data size is not known beforehand, such as live broadcasting data; or it's too costly to do so, such as tapes whose size is unknown at the beginning.There is another solution if we introduce a special EOF symbol to the alphabet. The symbol takes a small interval and is used only at the end of the message. When the decoder detects the EOF symbol it knows the end of the message is reached.

Page 44: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

45

Dictionary-Based Compression

The compression algorithms we studied so far use a statistical model to encode single symbolsCompression: Encode symbols into bit strings that use fewer bits.

Dictionary-based algorithms do not encode single symbols as variable-length bit strings; they encode variable-length strings of symbols as single tokensThe tokens form an index into a phrase dictionaryIf the tokens are smaller than the phrases they replace, compression occurs.

Dictionary-based compression is easier to understand because it uses a strategy that programmers are familiar with-> using indexes into databases to retrieve information from large amounts of storage.Telephone numbersPostal codes

Page 45: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

46

Dictionary-Based Compression: Example

Consider the Random House Dictionary of the English Language, Second edition, Unabridged. Using this dictionary, the string:

A good example of how dictionary based compression works

can be coded as:1/1 822/3 674/4 1343/60 928/75 550/32 173/46 421/2

Coding: Uses the dictionary as a simple lookup tableEach word is coded as x/y, where, x gives the page in the dictionary and y gives the number of the word on that page.The dictionary has 2,200 pages with less than 256 entries per page: Therefore x requires 12 bits and y requires 8 bits, i.e., 20 bits per word (2.5 bytes per word).Using ASCII coding the above string requires 48 bytes, whereas our encoding requires only 20 (<-2.5 * 8) bytes: 50% compression.

Page 46: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

47

Adaptive Dictionary-based Compression

Build the dictionary adaptivelyNecessary when the source data is not plain text, say audio or video data.Is better tailored to the specific source.

Original methods due to Ziv and Lempel in 1977 (LZ77) and 1978 (LZ78). Terry Welch improved the scheme in 1984 (called LZW compression). It is used in, UNIX compress, and, GIF. LZ77: A sliding window technique in which the dictionary consists of a set of fixed length phrases found in a window into the previously processed textLZ78: Instead of using fixed-length phrases from a window into the text, it builds phrases up one symbol at a time, adding a new symbol to an existing phrase when a match occurs.

Page 47: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

48

LZW Algorithm

Preliminaries: A dictionary that is indexed by “codes” is used. The dictionary is assumed to be initialized with 256 entries (indexed

with ASCII codes 0 through 255) representing the ASCII table. The compression algorithm assumes that the output is either a file

or a communication channel. The input being a file or buffer. Conversely, the decompression algorithm assumes that the input is

a file or a communication channel and the output is a file or a buffer.

DecompressionCompressionfile/buffer Compressed file/

Communication channelfile/buffer

Page 48: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

49

LZW Algorithm

LZW Compression: set w = NIL loop

read a character k if wk exists in the dictionary w = wk else output the code for w add wk to the dictionary

w = k endloop

The program reads one character at a time. If the code is in the dictionary, then it adds the character to the current work string, and waits for the next one. This occurs on the first character as well. If the work string is not in the dictionary, (such as when the second character comes across), it adds the work string to the dictionary and sends over the wire (or writes to a file) the code assigned to the work string without the new character. It then sets the work string to the new character.

Page 49: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

50

Input String: ^WED^WE^WEE^WEB^WET

w k Output

Index Symbol

NIL ^

^ W ^ 256 ^W

W E W 257 WE

E D E 258 ED

D ^ D 259 D^

^ W

^W E 256 260 ^WE

E ^ E 261 E^

^ W

^W E

^WE E 260 262 ^WEE

E ^

E^ W 261 263 E^W

W E

WE B 257 264 WEB

B ^ B 265 B^

^ W

^W E

^WE T 260 266 ^WET

T EOF T

set w = NIL loop

read a character k if wk exists in the dictionary w = wk else output the code for w add wk to the dictionary

w = k endloop

Page 50: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

51

LZW Algorithm

LZW Decompression:read fixed length token k (code or char) output k w = k loop read a fixed length token k entry = dictionary entry for k output entry add w + first char of entry to the dictionary w = entry

endloop

The nice thing is that the decompressor builds its own dictionary on its side, that matches exactly the compressor's, so that only the codes need to be sent.

Page 51: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

52

Example of LZWInput String (to decode): ^WED<256>E<260><261><257>B<260>T

w k Output Index Symbol

^ ^

^ W W 256 ^W

W E E 257 WE

E D D 258 ED

D <256> ^W 259 D^

^W E E 260 ^WE

E <260> ^WE 261 E^

^WE <261> E^ 262 ^WEE

E^ <257> WE 263 E^W

WE B B 264 WEB

B <260> ^WE 265 B^

^WE T T 266 ^WET

read a fixed length token k (code or char)

output k w = k loop

read a fixed length token k (code or char)

entry = dictionary entry for k output entry add w + first char of entry to the dictionary w = entry endloop

Page 52: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

53

LZW Algorithm - Discussion

9 bits

0

9 bits

1

<- ASCII characters

(0 to 255)

<- Codes

(256 to 512)

Where is the compression? Original String to decode : ^WED^WE^WEE^WEB^WET Decoded String : ^WED<256>E<260><261><257>B<260>T Plain ASCII coding of the string : 19 * 8 bits = 152 bits LZW coding of the string: 12*9 bits = 108 bits (7 symbols and 5

codes, each of 9 bits) Why 9 bits?

An ASCII character has a value ranging from 0 to 255 All tokens have fixed length There has to be a distinction in representation between an

ASCII character and a Code (assigned to strings of length 2 or more)

Codes can only have values 256 and above

Page 53: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

54

LZW Algorithm – Discussion (continued)

With 9 bits we can only have a maximum of 256 codes for strings of length 2 or above (with the first 256 entries for ASCII characters)

Original LZW uses dictionary with 4K entries, with the length of each symbol/code being 12 bits

12 bits

0<- ASCII characters

(0 to 255 entries)

<- Codes

(256 to 4096 entries)

00 0

100 0

111 1

With 12 bits, we can have a maximum of 212 – 256 codes.

Page 54: UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

55

Practical implementations of LZW algorithm follow the two approaches:Flush the dictionary periodically

– no wasted codes

Grow the length of the codes as the algorithm proceeds - First start with a length of 9 bits for the codes. - Once we run out of codes, increase the length to 10 bits. When we run out of

codes with 10 bits then we increase the code length to 11 bits and so on. - more efficient.

0 ASCII

1 Codes 256-512 0 0 ASCII

0 1 Codes 256-511

1 0 Codes 512-767

1 1 Codes 768-1023

0 0 0 ASCII

0 0 1 Codes 256-511

0 1 0 Codes 512-767

0 1 1 Codes 768-1023

1 0 0 Codes 1024-1279

1 0 1 Codes 1280-1535

1 1 0 Codes 1536-1791

1 1 1 Codes 1792-2047

Out of codes

Out of codes