Data Compression Complete

42
7/23/2019 Data Compression Complete http://slidepdf.com/reader/full/data-compression-complete 1/42 DATA COMPRESSION The word data is in general used to mean the information in digital form on which computer  programs operate, and compression means a process of removing redundancy in the data. By 'compressing data', we actually mean deriving techniques or, more specifically, designing efficient algorithms to: represent data in a less redundant fashion remove the redundancy in data Implement compression algorithms, including both compression and decompression. Data ompression means encoding the information in a file in such a way that it ta!es less space. ompression is used "ust about everywhere. #ll the images you get on the web are compressed, typically in the $%& or I( formats, most modems use compression, )DT* will be compressed using +%&-, and several file systems automatically compress files when stored, and the rest of us do it by hand. The tas! of compression consists of two components, an encoding algorithm that ta!es a message and generates a compressed/ representation 0hopefully with fewer bits1, and a decoding algorithm that reconstructs the original message or some appro2imation of it from the compressed representation. ompression denotes compact representation of data. &2amples for the !ind of data we typically want to compress are e.g. te2t sourcecode arbitrary files images video audio data speech Why do we need compression ? ompression Technology is employed to efficiently use storage space, to save on transmission capacity and transmission time, respectively. Basically, its all about saving resources and money. Despite of the overwhelming advances in the areas of storage media and transmission networ!s it is actually quite a surprise that still compression technology is required. 3ne important reason is that also the resolution and amount of digital data has increased 0e.g. )DT* resolution, ever increasing sensor si4es in consumer cameras1, and that there are still application areas where resources are limited, e.g. wireless networ!s. #part from the aim of simply reducing the amount of data, standards li!e +%&5, +%&6, and +%&-7 offer additional functionalities. Why is it possible to compress dt ? Compression-enabling properties are: Statistical redundancy: in non-compressed data, all symbols are represented with the same number of bits independent of their relative frequency (xed length representation! orrelation: ad"acent data samples tend to be equal or similar 0e.g. thin! of images or video data1.There are different types of correlation:

Transcript of Data Compression Complete

Page 1: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 1/42

DATA COMPRESSIONThe word data is in general used to mean the information in digital form on which computer

 programs operate, and compression means a process of removing redundancy in the data. By'compressing data', we actually mean deriving techniques or, more specifically, designing

efficient algorithms to:

• represent data in a less redundant fashion• remove the redundancy in data

• Implement compression algorithms, including both compression and decompression.

Data ompression means encoding the information in a file in such a way that it ta!es less space.ompression is used "ust about everywhere. #ll the images you get on the web are compressed,

typically in the $%& or I( formats, most modems use compression, )DT* will be compressed

using +%&-, and several file systems automatically compress files when stored, and the restof us do it by hand. The tas! of compression consists of two components, an encoding algorithm

that ta!es a message and generates a compressed/ representation 0hopefully with fewer bits1,

and a decoding algorithm that reconstructs the original message or some appro2imation of itfrom the compressed representation.

ompression denotes compact representation of data.

&2amples for the !ind of data we typically want to compress are e.g.

• te2t

• sourcecode

• arbitrary files

• images

• video

• audio data

• speech

Why do we need compression ?ompression Technology is employed to efficiently use storage space, to save on transmission

capacity and transmission time, respectively. Basically, its all about saving resources and money.Despite of the overwhelming advances in the areas of storage media and transmission networ!s it

is actually quite a surprise that still compression technology is required. 3ne important reason is

that also the resolution and amount of digital data has increased 0e.g. )DT* resolution, everincreasing sensor si4es in consumer cameras1, and that there are still application areas where

resources are limited, e.g. wireless networ!s. #part from the aim of simply reducing the amount

of data, standards li!e +%&5, +%&6, and +%&-7 offer additional functionalities.

Why is it possible to compress dt ?Compression-enabling properties are:

• Statistical redundancy: in non-compressed data, all symbols are representedwith the same number of bits independent of their relative frequency (xedlength representation!

• orrelation: ad"acent data samples tend to be equal or similar 0e.g. thin! of images or

video data1.There are different types of correlation:

Page 2: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 2/42

  8patial correlation

  8pectral correlation

  Temporal correlation

In addition, in many data types there is a significant amount of irrelevancy since the human brainis not able to process and9or perceive the entire amount of data. #s a consequence, such data can

 be omitted without degrading perception. (urthermore, some data contain more abstract properties which are independent of time, location, and resolution and can be described veryefficiently 0e.g. fractal properties1.

ompression techniques are broadly classified into two categories:

!ossless Compression# compression approach is lossless only if it is possible to e2actly reconstruct the original data

from the compressed version. There is no loss of any information during the compression process.

(or e2ample, in (igure below, the input string ##BBB# is reconstructed after the e2ecution of

the compression algorithm followed by the decompression algorithm.

ossless compression is called reversible compression since the original data may be recovered perfectly by decompression.

ossless compression techniques are used when the original data of a source are so important

that we cannot afford to lose any details. &2amples of such source data are medical images, te2t

and images preserved for legal reason, some computer e2ecutable files, etc.

In lossless compression 0as the name suggests1 data are reconstructed after compression without

errors,i.e. no information is lost. Typical application domains where you do not want to loose

information is compression of te2t, files, fa2. In case of image data, for medical imaging or thecompression of maps in the conte2t of land registry no information loss can be tolerated. #

further reason to stic! to lossless coding schemes instead of lossy ones is their lower

computational demand. ossless ompression typically is a process with three stages:

Page 3: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 3/42

• The model: the data to be compressed is analy4ed with respect to its structure and the

relative frequency of the occurring symbols.

• The encoder: produces a compressed bitstream 9 file using the information provided by

the model.

• The adaptor: uses information e2tracted from the data 0usually during encoding1 in order

to adapt the model 0more or less1 continuously to the data.The most fundamental idea in lossless compression is to employ codewords which are shorter 0interms of their binary representation1 than their corresponding symbols in case the symbols do

occur frequently. 3n the other hand, codewords are longer than the corresponding symbols in

case the latter do not occur frequently

!ossy dt compression

# compression method is lossy if it is not possible to reconstruct the original e2actly from the

compressed version. There are some insignificant details that may get lost during the process of

compression. The word insignificant here implies certain requirements to the quality of thereconstructed data. (igure below shows an e2ample where a long decimal number becomes a

shorter appro2imation after the compressiondecompression process.

ossy compression is called irreversible compression since it is impossible to recover theoriginal data e2actly by decompression. #ppro2imate reconstruction may be desirable since it

may lead to more effective compression. )owever, it often requires a good balance between the

visual quality and the computation comple2ity.

Data such as multimedia images, video and audio are more easily compressed by lossycompression techniques because of the way that human visual and hearing systems wor!.

Page 4: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 4/42

Page 5: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 5/42

lossless data compression algorithms will always fail to compress some files? indeed, any

compression algorithm will necessarily fail to compress any data containing no discernible

 patterns. #ttempts to compress data that has been compressed already will therefore usuallyresult in an e2pansion, as will attempts to compress encrypted data.

Mes$re o% Per%ormnce

  odeword: # binary string representing either the whole coded data or one coded data

symbol

oded Bitstream: the binary string representing the whole coded data.

  ossless ompression: 7<<@ accurate reconstruction of the original data

  ossy ompression: The reconstruction involves errors which may or may not be tolerable

  Bit Aate: #verage number of bits per original data element after compression

 

&rible len'th codes

"ariable length codes are desirable for data compression because overall savings

may be achieved by assigning short codewords to frequently occurring symbols and

long codewords to rarely occurring ones!

#or example, consider a variable length code ($, %$$, %$%, %%$, %%% with lengths of 

codewords (%, &, &, &, & for alphabet (', , C, ), *, and a source string

'''''''C with frequencies for each symbol (+, %, %, $, $! he average number of 

bits required is

 his is almost a saving of half the number of bits compared to & bitssymbol using a

& bit xed length code!

 he shorter the codewords, the shorter the total length of a source le! .ence the

code would be a better one from the compression point of view!

Page 6: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 6/42

(ni)$e decodbility

"ariable length codes are useful for data compression! .owever, a variable lengthcode would be useless if the codewords could not be identied in a unique way from

the encoded message!

*xample Consider the variable length code ($, %$, $%$, %$% for alphabet (', , C,

)! ' segment of encoded message such as /$%$$%$%$%$/ can be decoded in more

than one way! #or example, /$%$$%$%$%$/ can be interpreted in at least two ways, /$

%$ $%$ %$% $/ as 'C)' or /$%$ $ %$% $%$/ as C')C!

' code is uniquely decodable if there is only one possible way to decode encoded

messages! he code ($, %$, $%$, %$% in *xample above is not uniquely decodable

and therefore cannot be used for data compression!

Pre%i* codes nd binry trees

Codes with the self-punctuating property do exist! ' type of so-called prex code

can be identied by chec0ing its so-called prex-free property or prex property for

short!

' prex is the rst few consecutive bits of a codeword! 1hen two codewords are of

di2erent lengths, it is possible that the shorter codeword is identical to the rst few

bits of the longer codeword! 3n this case, the shorter codeword is said to be a prex

of the longer one!

*xample 4!& Consider two binary codewords of di2erent length: C%5 $%$ (& bits

and C4 5 $%$%% (6 bits!

 he shorter codeword C% is the prex of the longer code C4 as C4 5$%$ %%!

Codeword C4 can be obtained by appending two more bits %% to C%!

 he prex property of a binary code is the fact that no codeword is a prex of

another!

Pre%i* codes nd $ni)$e decodbility

7rex codes are a subset of the uniquely decodable codes! his means that all

prex codes are uniquely decodable! 3f a code is a prex code, the code is then

uniquely decodable!

Page 7: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 7/42

.owever, if a code is not a prex code, we cannot conclude that the code is not

uniquely decodable! his is because other types of code may also be uniquely

decodable!

*xample Consider code ($, $%, $%%, $%%% for (', , C, )! his is not a prex code

as the rst codeword $ is the prex of the others,

.owever, given an encoded message $%$%%$%$%%%, there is no ambiguity and only

one way to decode it: $% $%% $% $%%%, i!e! C)! *ach $ o2ers a means of self-

punctuating in this example! 1e only need to watch out the 8, the beginning of

each codeword and the bit % before any 8, the last bit of the codeword!

Some codes are uniquely decodable but require loo0ing ahead during the decoding

process! his ma0es them not as e9cient as prex codes!

Sttic +$%%mn codin'

.u2man coding is a successful compression method used originally for textcompression! 3n any text, some characters occur far more frequently than others!

#or example, in *nglish text, the letters *, ', 8, are normally used much more

frequently than , ;, <!

.u2man/s idea is, instead of using a xed-length code such as = bit extended 'SC33

or *C)3C for each symbol, to represent a frequently occurring character in a source

with a shorter codeword and to represent a less frequently occurring one with a

longer codeword! .ence the total number of bits of this representation is

signicantly reduced for a source of symbols with di2erent frequencies! he number

of bits required is reduced for each symbol on average!

Static .u2man coding assigns variable length codes to symbols based on

their frequency of occurrences in the given message! >ow frequency symbols

are encoded using many bits, and high frequency symbols are encoded using

fewer bits!

 he message to be transmitted is rst analy?ed to nd the relative

frequencies of its constituent characters!

 he coding process generates a binary tree, the .u2man code tree, with

branches labeled with bits ($ and %!

 he .u2man tree (or the character codeword pairs must be sent with the

compressed information to enable the receiver decode the message!

Sttic +$%%mn Codin' Al'orithm

#ind the frequency of each character in the le to be compressed@

Page 8: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 8/42

#or each distinct character create a one-node binary tree containing the character

and its frequency as its priority@

3nsert the one-node binary trees in a priority queue in increasing order of frequency@

while (there are more than one tree in the priority queue A

dequeue two trees t% and t4@

Create a tree t that contains t% as its left subtree and t4 as its right subtree@ %

priority (t 5 priority(t% B priority(t4@

insert t in its proper location in the priority queue@ 4

 

'ssign $ and % weights to the edges of the resulting tree, such that the left and

right edge of each node do not have the same weight@ &

Note: The Hufman code tree or a particular set o characters is not

unique.

(Steps 1, 2, and 3 may e done diferently!.

"#ample: : 3nformation to be transmitted over the internet contains the following

characters with their associated frequencies:

Dse .u2man technique to answer the following questions:

uild the .u2man code tree for the message!

Dse the .u2man tree to nd the codeword for each character!

3f the data consists of only these characters, what is the total number

of bits to be transmittedE 1hat is the compression ratioE

"erify that your computed .u2man codewords satisfy the 7rex property!

Solution:

Sort the list of characters in increasing order of frequency!

t s o n l e a $haracter

%

3

22 1& '% 13 % '% )requenc

y

Page 9: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 9/42

Fow create .u2man tree!

Page 10: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 10/42

,inl +$%%mn Tree

 Cow assign codes to edges in the tree. %ut left edge as < and right edge as 7

The sequence of 4eros and ones that are the arcs in the path from the root to each

leaf node are the desired codes:

t s o n l e chrcter

-- -.- -... ... -..- .- ..- +$%%mn codeword

Page 11: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 11/42

If we assume the message consists of only the characters a,e,l,n,o,s,t then the

number of bits for the compressed message will be E:

If the message is sent uncompressed with Fbit #8II representation for the

characters, we have -7GF H -<FF bits.

#ssuming that the number of charactercodeword pairs and the pairs are included

at the beginning of the binary file containing the compressed message in thefollowing format:

 Cumber of bits for the transmitted file H bits061 bits0characters1

its0codewords1 bits0compressed message1

Page 12: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 12/42

  H = 06GF1 -7 E H 66

 

ompression ratio H bits for #8II representation 9 number of bits transmitted

  H -<FF 9 66 H -.E

Thus, the si4e of the transmitted file is 7<< 9 -.E H =6@ of the original #8II file

The Pre%i* Property

Data encoded using )uffman coding is uniquely decodable. This is because

)uffman codes satisfy an important property called the prefi2 property. In a given

set of )uffman codewords, no codeword is a prefi2 of another )uffman codeword

(or e2ample, in a given set of )uffman codewords, 7< and 7<7 cannot

simultaneously be valid )uffman codewords because the first is a prefi2 of the

second.

>e can see by inspection that the codewords we generated in the previous e2ample

are valid )uffman codewords.

To see why the prefi2 property is essential, consider the codewords given below inwhich e/ is encoded with ..- which is a prefi2 of f/

character a b c d e f  

codeword < 7<7 7<< 777 77< 77<<

The decoding of 77<<<7<<77< is ambiguous:..---.--..-  HJ %ce

..---.--..-  HJ ece

Page 13: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 13/42

Optiml +$%%mn codes

)uffman codes are optimal when probabilities of the source symbols are all

negative powers of two. &2amples of a negative power of two are K, L, etc.

The conclusion can be drawn from the following "ustification.

8uppose that the lengths of the )uffman code are  LH 0l7,l-,l=,MMln1 for a source %

H 0p7, p-,MMpn1 , where n is the si4e of the alphabet.

Nsing a variable length code to the symbols, l " bits for p ", the average length of the

codewords is 0in bits1:

# code is optimal if the average length of the codewords equals the entropy of thesource.

!et

And notice

Page 14: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 14/42

This equation holds if and only if l " H log- p " for all " H 7 , - , . . n , because l " has

to be an integer 0in bits1. 8ince the length l " has to be an integer 0in bits1 for

)uffman codes, log- p " has to be an integer, too. 3f course, log- p " cannot be an

integer unless p " is a negative power of -, for all " H 7 , - , . . . ,n.

In other words, this can only happen if all probabilities are negative powers of - in

)uffman codes, for l " has to be an integer 0in bits1.

 (or e2ample, for a source % 079-' 795' 79F' 79F1 . )uffman codes for the source can

 be optimal

Optimlity o% +$%%mn codin'

1e show that the prex code generated by the .u2man coding algorithm is optimalin the sense of minimi?ing the expected code length among all binary prex codesfor the input alphabet!

 he leaf merges operation

' 0ey ingredient in the proof involves constructing a new tree from an existingbinary coding tree by eliminating two sibling leaves a% and a4, replacing them bytheir parent node, labeled with the sum of the probabilities of a% and a4! 1e willdenote the new tree obtained in this way merge(, a%, a4! >i0ewise, if ' is thealphabet of symbols for , that is, the set of leaves of , then we dene a newalphabet '$, denoted merge(', a%, a4, as the alphabet consisting of all symbols of 'other than a% and a4, together with a new symbol a that represents a% and a4

combined!

 wo important observations related to these leaf merging operations follow!

%! >et be the .u2man tree for an alphabet ', and let a% and a4 be the twosymbols of lowest probability in '! >et G be the .u2man tree for the reducedalphabet 'G5 merge(', a%, a4!hen G 5 merge(, a%, a4! 3n other words,the .u2man tree for the merged alphabet is the merge of the .u2man treefor the original alphabet! his is true simply by the denition of the .u2manprocedure!

4! he expected code length for exceeds that of merge(, a%, a4 by preciselythe sum p of the probabilities of the leaves a% and a4! his is because in thesum that denes the expected code length for the merged tree, the term dp,

where d is the depth of the parent a of these leaves, is replaced by two terms(corresponding to the leaves which sum to (d B %p!

Proo% o% optimlity o% +$%%mn codin'

1ith the above comments in mind, we now give the formal optimality proof! 1e

show by induction in the si?e, n, of the alphabet, ', that the .u2man coding

Page 15: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 15/42

algorithm returns a binary prex code of lowest expected code length among all

prex codes for the input alphabet!

H (asis 3f n 5 4, the .u2man algorithm nds the optimal prex code, which

assigns $ to one symbol of the alphabet and % to the other!

H (3nduction .ypothesis #or some n I 4, .u2man coding returns an optimal prex

code for any input alphabet containing n symbols!

H (3nductive Step 'ssume that the 3nduction .ypothesis (3. holds for some value

of n! >et ' be an input alphabet containing n B % symbols! 1e show that .u2man

coding returns an optimal prex code for '!

>et a% and a4 be the two symbols of smallest probability in '! Consider the merged

alphabet 'G 5merge(', a%, a4 as dened above! y the 3., the .u2man tree G for

this merged alphabet 'G is optimal! 1e also 0now that G is in fact the same as the

tree merge(, a%, a4 that results from the .u2man tree for the original alphabet

' by replacing a% and a4 with their parent node! #urthermore, the expected codelength > for exceeds the expected code length >G for G by exactly the sum p of

the probabilities of a% and a4!

1e claim that no binary coding tree for ' has an expected code length less than > 5

>G Bp!

>et 4 be any tree of lowest expected code length for '! 1ithout loss of generality,

a% and a4 will be leaves at the deepest level of 4, since otherwise one could swap

them with shallower leaves and reduce the code length even further! #urthermore,

we may assume that a% and a4 are siblings in 4! herefore, we obtain from 4 a

coding tree 4G for the merged alphabet 'G through the merge procedure describedabove, replacing a% and a4 by their parent labeled withGthe sum of their

probabilities:

 4G 5merge(4, a%, a4! y the observation above, the expected code lengths >4

and >4G of 4 and 4G respectively satisfy >4 5 >4G B p! ut by the 3., G is optimal

for the alphabet 'G! herefore,

>4G I >G! 3t follows that >4 5 >4G B p I >G B p 5 >,

1hich shows that is optimal as claimed!

Minim$m &rince +$%%mn Codes

The )uffman coding algorithm has some fle2ibility when two equal frequencies are found. The

choice made in such situations will change the final code including possibly the code length of

each message. 8ince all )uffman codes are optimal, however, it cannot change the averagelength.

Page 16: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 16/42

(or e2ample, consider the following message probabilities, and codes. symbol probability code 7

code -

Both codings produce an average of -.- bits per symbol, even though the lengths are quite

different in the two codes. iven this choice, is there any reason to pic! one code over the otherO

(or some applications it can be helpful to reduce the variance in the code length. The varianceis defined as

>ith lower variance it can be easier to maintain a constant character transmission rate, or reduce

the si4e of buffers. In the above e2ample, code 7 clearly has a much higher variance than code -.It turns out that a simple modification to the )uffman algorithm can be used to generate a code

that has minimum variance. In particular when choosing the two nodes to merge and there is a

choice based on weight, always pic! the node that was created earliest in the algorithm. eafnodes are assumed to be created before all internal nodes. In the e2ample above, after d and e

are "oined, the pair will have the same probability as c and a 0.-1, but it was created afterwards,

so we "oin c and a. 8imilarly we select b instead of ac to "oin with de since it was created earlier.This will give code - above, and the corresponding )uffman tree in (igure below

.

Page 17: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 17/42

E*tended +$%%mn codin'

3ne problem with )uffman codes is that they meet the entropy bound only when

all probabilities are powers of -. >hat would happen if the alphabet is binary, e.g.

8 H 0a,b1O The only optimal case = is when % H 0%a,% b1, %a H K and % b H K . )ence,

)uffman codes can be bad.

(or &2ample:

onsider a situation when %a H <.F and % b H <.-.

Sol$tion/ 8ince )uffman coding needs to use 7 bit per symbol at least, to encode

the input, the )uffman codewords are 7 bit per symbol on average:

)owever, the entropy of the distribution is

The efficiency of the code is

This gives a gap of 7 <.6- H <.-F bit. The performance of the )uffman encoding

algorithm is, therefore, 0<.-F971 H -F@ worse than optimal in this case.

The idea of e2tended )uffman coding is to encode a sequence of source symbols

instead of individual symbols. The alphabet si4e of the source is artificially

increased in order to improve the code efficiency. (or e2ample, instead of

assigning a codeword to every individual symbol for a source alphabet, we derive a

codeword for every two symbols.

The following e2ample shows how to achieve this:

Page 18: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 18/42

&2ample 5.F reate a new alphabet 8PH 0aa, ab, ba, bb1 e2tended from 8 H 0a, b1.

et aa H #, ab H B, ba H and bb H D. >e now have an e2tended alphabet 8' H 0#,

B, , D1. &ach symbol in the alphabet 8P is a combination of two symbols from the

original alphabet 8. The si4e of the alphabet 8P increases to -- H 5 .

8uppose symbol 'a' or 'b' occurs independently. The probability distribution for 8P,

the e2tended alphabet, can be calculated as below:

%# H %a Q %a H <.5

%B H %a Q % b H <.7

% H % b Q %a H <.7

%D H % b Q % b H <.<5

>e then follow the normal static )uffman encoding algorithm to derive the

)uffman code for 8.

The canonical minimumvariance code for 8' is 0<, 77, 7<<, 7<71, for #, B,, D

respectively. The average length is 7.R bits for two symbols.

The original output became 7.R9- H <.6F bit per symbol. The efficiency of the

code has been increased to <.6-99<.6F S E-@. This is only 0 <.6 F <.6-19<.6F S F@

worse than optimal.

Dynmic0Adpti"e +$%%mn Codin'

(or )uffman coding, we need to !now the probabilities for the individual symbols

0and we also need to !now the "oint probabilities for bloc!s of symbols in e2tended

)uffman coding1. If the probability distributions are not !nown for a file of

characters to be transmitted, they have to be estimated first and the code itself has

to be included in the transmission of the coded file. In dynamic )uffman coding a

 particular probability 0frequency of symbols1 distribution is assumed at the

transmitter and receiver and hence a )uffman code is available to start with. #ssource symbols come in to be coded the relative frequency of the different symbols

is updated at both the transmitter and the receiver, and corresponding to this the

code itself is updated. In this manner the code continuously adapts to the nature of

the source distribution, which may change as time progresses and different files are

 being transferred.

Page 19: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 19/42

Dynamic )uffman coding is the basis of data compression algorithms used in *

series modems for transmission over the %8TC. In particular the +C% 0+icrocom

 Cetwor!ing %rotocol1 lass R protocol commonly found in modems such as the

*.=-bis modem uses both CRC error detection nd dynmic +$%%mn codin'

%or dt compression#

Adpti"e Approch

In the adaptive )uffman coding, an alphabet and frequencies of its symbols are

collected and maintained dynamically according to the source file on each

iteration. The )uffman tree is also updated based on the alphabet and frequencies

dynamically. >hen the encoder and decoder are at different locations, both

maintain an identical )uffman tree for each step independently. Therefore, there is

no need transferring the )uffman tree.

During the compression process, the )uffman tree is updated each time after a

symbol is read. The codeword0s1 for the symbol is output immediately. (or

convenience of discussion, the frequency of each symbol is called the weight of the

symbol to reflect the change of the frequency count at each stage.

The output of the adaptive )uffman encoding consists of )uffman codewords as

well as fi2ed length codewords. (or each input symbol, the output can be a

)uffman codeword based on the )uffman tree in the previous step or a codewordof a fi2ed length code such as #8II. Nsing a fi2ed length codeword as the output

is necessary when a new symbol is read for the first time. In this case, the )uffman

tree does not include the symbol yet. It is therefore reasonable to output the

uncompressed version of the symbol. If the source file consists of #8II, then the

fi2ed length codeword would simply be the uncompressed version of the symbol.

In the encoding process, for e2ample, the model outputs a codeword of a fi2ed

length code such as #8II code, if the input symbol has been seen for the first

time. 3therwise, it outputs a )uffman codeword.

)owever, a mi2ture of the fi2ed length and variable length codewords can cause

 problems in the decoding process. The decoder needs to !now whether the

codeword should be decoded according to a )uffman tree or by a fi2ed length

Page 20: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 20/42

codeword before ta!ing a right approach. # special symbol as a flag, therefore, is

used to signal a switch from one type of codeword to another.

et the current alphabet be the subset 8 H 0 , s l , s - , . . . ,8n1 of some alphabet U

and g0si 1 be any fi2ed length codeword for si  0e.g. #8II code1, i H 7, - , . . . . To

indicate whether the output codeword is a fi2ed length or a variable length

codeword, one special symbol 0does not belongs to U1 is defined as a flag or a

shift !ey and to be placed before the fi2ed length codeword for communication

 between the compressor and decompressor.

The compression algorithm maintains a subset 8 of symbols of some alphabet U

08 is subset of U1 that the system has seen so far. )uffman code 0i.e. the )uffman

tree1 for all the symbols in 8 is also maintained. et the weight of always be <and the weight of any other symbol in 8 be its frequency so far. (or convenience,

we represent the weight of each symbol by a number in round brac!ets. (or

e2ample, #071 means that symbol # has a weight of 7.

Initially, 8 H VW and the )uffman tree has the single node of symbol 0see (igure

 below step 0<11. During the encoding process, the alphabet 8 grows in number of

symbols each time a new symbol is read. The weight of a new symbol is always 7

and the weight of an e2isting symbol in 8 is increased by 7 when the symbol is

read. The )uffman tree is used to assign codewords to the symbols in 8 and is

updated after each output.

The following e2ample shows the idea of the adaptive )uffman coding.

&2ample : 8uppose that the source file is a string #BBB#. (igure below shows

states of each step of the adaptive )uffman encoding algorithm.

Page 21: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 21/42

Page 22: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 22/42

Disd"nt'es o% +$%%mn l'orithms

#daptive )uffman coding has the advantage of requiring no preprocessing and the

low overhead of using the uncompressed version of the symbols only at their first

occurrence.

The algorithms can be applied to other types of files in addition to te2t files.

The symbols can be ob"ects or bytes in e2ecutable files.

)uffman coding, either static or adaptive, has two disadvantages that remain

unsolved:

Disd"nt'e .: It is not optimal unless all probabilities are negative powers of -.

This means that there is a gap between the average number of bits and the entropy

in most cases.

Aecall the particularly bad situation for binary alphabets. #lthough by grouping

symbols and e2tending the alphabet, one may come closer to the optimal, the

 bloc!ing method requires a larger alphabet to be handled. 8ometimes, e2tended

)uffman coding is not that effective at all.

Disd"nt'e 1: Despite the availability of some clever methods for counting the

frequency of each symbol reasonably quic!ly, it can be very slow when rebuilding

the entire tree for each symbol. This is normally the case when the alphabet is big

and the probability distributions change rapidly with each symbol.

RICE CODES

Jice *ncoding (a special case of Kolomb Coding can be applied to reduce the bits

required to represent the lower value numbers! Jice/s algorithm seemed easy to

implement

Famed after *oert *ice, Jice coding is a specialised form of Kolomb coding! 3t/s

used to encode strings of numbers with a variable bit length for each number! 3f

most of the numbers are small, fairly good compression can be achieved! Jice

coding is generally used to encode entropy in an audiovideo codec!

Aice coding depends on a parameter k  and wor!s the same as olomb coding with a parameter m

where m = 2k . To encode a number, x:

Page 23: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 23/42

7. et q = x / m 0round fractions down1

>rite out q binary ones.

-. >rite out a binary 4ero.08ome people prefer to do it the other way 4eroes followed by a one1

=. >rite out the last k  bits of x

Decoding wor!s the same way, "ust bac!wards. Xou can see my implementation here:

Al'orithm O"er"iew

iven a constant M, any symbol S can be represented as a quotient 0Q1 and remainder 0R1, where:

S = Q × M + R.

If S is small 0relative to M1 then Q will also be small. Aice encoding is designed to reduce the

number of bits required to represent symbols where Q is small.

Aather than representing both Q and R as binary values, Aice encoding represents Q as a unary

value and R as a binary value.

(or those not familiar with unary notation, a value N may be represented by N 7s followed by a <.

E*mple/ = H 777< and R H 77777<.

Note/ The following is true for binary values, if log2(M) = K where K is an integer:

7. Q = S >> K 0S left shifted Y bits1

-. R = S & (M - 1) 0S bitwise #CDed with (M - 1)1

=. R can be represented using K bits.

Encodin'

Aice coding is fairly straightforward.

iven a bit length, K. ompute the modulus, M using by the equation M = 2K. Then do following

for each symbol 0S1:

%! 1rite out S & (M - 1) in binary!

4! 1rite out S >> K in unary!

That's it. I told you it was straightforward.

Page 24: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 24/42

E*mple/

&ncode the Fbit value 7F 0<b<<<7<<7<1 when K H 5 0M H 71

%! S & (M - 1) 5 %= L (%M - % 5 $$$%$$%$ L %%%% 5 $$%$

4! S >> K %= NN O 5 $b$$$%$$%$ NN O 5 $b$$$% (%$ in unary

8o the encoded value is 7<<<7<, saving - bits.

Decodin'

Decoding isn't any harder than encoding.

#s with encoding, given a bit length, K. ompute the modulus, M using by the equation M = 2K .

Then do following for each encoded symbol 0S1:

%! )etermine Q by counting the number of %s before the rst $!

4! )etermine R reading the next K bits as a binary value!

&! 1rite out S as Q × M + R!

E*mple/

Decode the encoded value 7<<<7< when K H 5 0M H 71

%! Q 5 %

4! R 5 $b$$%$ 5 4

&! S 5 Q × M + R 5 % P %M B 4 5 %=

Jice coding only wor0s well when symbols are encoded with small values of Q!Since Q is unary, encoded symbols can become quite large for even slightly largeQs! 3t ta0es = bits Qust to represent the value +! 8ne way to improve thecompression obtained by Jice coding on generic data is to apply reversibletransformation on the data that reduces the average value of a symbol! heurrows-1heeler ransform (1 with Rove-o-#ront (R# encoding is such atransform!

De"elopement/

Page 25: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 25/42

$acob Ziv and #braham empel had introduced a simple and efficient compression method

 published in their article ;# Nniversal #lgorithm for 8equential Data ompression;. This

algorithm is referred to as Z66 in honour to the authors and the publishing date 7E66.

,$ndmentls/

Z66 is a dictionary based algorithm that addresses byte sequences from former contents instead

of the original data. In general only one coding scheme e2ists, all data will be coded in the sameform:

• #ddress to already coded contents

• 8equence length

• (irst deviating symbol

If no identical byte sequence is available from former contents, the address <, the sequence

length < and the new symbol will be coded.

E*mple 2brcdbr2/

  Addr. Length deviting S!"#ol

 #r$d#r  % %

#r$d#r  % % #

# r$d#r  % % r

#r $d#r  ' 1  $

#r$ d#r  2 1  d

#r$d #r   

Because each byte sequence is e2tended by the first symbol deviating from the former contents,the set of already used symbols will continuously grow. Co additional coding scheme is

necessary. This allows an easy implementation with minimum requirements to the encoder and

decoder.

Page 26: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 26/42

Restrictions/

To !eep runtime and buffering capacity in an acceptable range, the addressing must be limited to

a certain ma2imum. ontents e2ceeding this range will not be regarded for coding and will not be covered by the si4e of the addressing pointer.

Compression E%%iciency/

The achievable compression rate is only depending on repeating sequences. 3ther types of

redundancy li!e an unequal probability distribution of the set of symbols cannot be reduced. (orthat reason the compression of a pure Z66 implementation is relatively low.

# significant better compression rate can be obtained by combining Z66 with an additionalentropy coding algorithm. #n e2ample would be )uffman or 8hannon(ano coding. The wide

spread Deflate compression method 0e.g. for ZI% or ZI%1 uses )uffman codes for instance.

3f these, Z66 is probably the most straightforward. It tries to replace recurring patterns in thedata with a short code. The code tells the decompressor how many symbols to copy and from

where in the output to copy them. To compress the data, Z66 maintains a history buffer which

contains the data that has been processed and tries to match the ne2t part of the message to it. Ifthere is no match, the ne2t symbol is output asis. 3therwise an 0offset,length1 pair is output.

*t,t itor! Loo/hed

  S003000S000

S S 003000S000

0 S0 03000S000

S0 03000S000

S0 03000S000

0 S00 3000S000

3 S003 000S0000 S0030 00S000

0 S00300 0S000

0 S003000 S000

  --- --- "t$h length4 '

  5----6---5 "t$h o77et4 6

(68 ') S003000S0 00

  -- -- "t$h length4 2

  525 "t$h o77et4 2

Page 27: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 27/42

(28 2) S003000S00 0

  -- -- "t$h length4 2

  5----11----5 "t$h o77et4 11

(118 2) S003000S000

#t each stage the string in the loo!ahead buffer is searched from the history buffer. The longestmatch is used and the distance between the match and the current position is output, with the

match length. The processed data is then moved to the history buffer. Cote that the history buffercontains data that has already been output. In the decompression side it corresponds to the data

that has already been decompressed. The message becomes:S 0 0 3 0 0 0 (68') (282) (1182)

The following describes what the decompressor does with this data.

itor! 0n,t

S

S 0

S0

S0

S0 0S00 3

S003 0

S0030 0

S00300 0

S003000 (68') -> S0

5----6---5

S003000S0 (282) -> 0

  525

S003000S00 (1182) -> 0

  5----11----5

S003000S000

In the decompressor the history buffer contains the data that has already been decompressed. Ifwe get a literal symbol code, it is added asis. If we get an 0offset,length1 pair, the offset tells usfrom where to copy and the length tells us how many symbols to copy to the current output

 position. (or e2ample 0E,=1 tells us to go bac! E locations and copy = symbols to the current

output position. The great thing is that we don't need to transfer or maintain any other datastructure than the data itself.

Lempel-Ziv 1977

3n %++ Tiv and >empel proposed a lossless compression method which replaces

phrases in the data stream by a reference to a previous occurrance of the phrase!

's long as it ta0es fewer bits to represent the reference and the phrase length than

the phrase itself, we get compression! Uind-of li0e the way 'S3C substitutes to0ensfor 0eywords!

Z66type compressors use a history buffer, which contains a fi2ed amount of symbols

output9seen so far. The compressor reads symbols from the input to a loo!ahead buffer and tries

to find as long as possible match from the history buffer. The length of the string match and the

location in the buffer 0offset from the current position1 is written to the output. If there is no

suitable match, the ne2t input symbol is sent as a literal symbol.

Page 28: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 28/42

3f course there must be a way to identify literal bytes and compressed data in the output. There

are lot of different ways to accomplish this, but a single bit to select between a literal and

compressed data is the easiest.

The basic scheme is a variabletobloc! code. # variablelength piece of the message is

represented by a constant amount of bits: the match length and the match offset. Because the data

in the history buffer is !nown to both the compressor and decompressor, it can be used in the

compression. The decompressor simply copies part of the already decompressed data or a literal

 byte to the current output position.

*ariants of Z66 apply additional compression to the output of the compressor, which include a

simple variablelength code 0ZB1, dynamic )uffman coding 0Z)1, and 8hannon(ano coding

0ZI% 7.211, all of which result in a certain degree of improvement over the basic scheme. This is

 because the output values from the first stage are not evenly distributed, i.e. their probabilities

are not equal and statistical compression can do its part.

+empel-i/& (+-/&!

3ne year after publishing Z66 $acob Ziv and #braham empel hat introduced another

compression method 0;ompression of Individual 8equences via *ariableAate oding;1.

#ccordingly this procdure will be called Z6F.

#undamental algorithm:

Z6F is based on a dictionary that will be created dynamically at runtime. Both the encoding and

the decoding process use the same rules to ensure that an identical dictionary is available. This

dictionary contains any sequence already used to build the former contents. The compressed data

have the general form:

• 3ndex addressing an entry of the dictionary

• #irst deviating symbol

Page 29: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 29/42

In contrast to Z66 no combination of address and sequence length is used. Instead only the

inde2 to the dictionary is stored. The mechanism to add the first deviating symbol remains from

Z66.

eispiel VabracadabraV:

  deviting ne9 entr!

  inde: !"#ol di$tionr!

 #r$d#r % 1 ;;

 #r$d#r % # 2 ;#;

# r$d#r % r ' ;r;

#r $d#r 1 $ ;$;

#r$ d#r 1 d < ;d;

#r$d #r 1 # ;#;

#r$d# r ' ;r;

# Z6F dictionary is slowly growing. (or a relevant compression a larger amount of data must

 be processed. #dditionally the compression is mainly depending on the si4e of the dictionary.

But a larger dictionary requires higher efforts for addressing and administration both at runtime.

In practice the dictionary would be implemented as a tree to minimi4e the efforts for searching.

8tarting with the current symbol the algorithm evaluates for every succeeding symbol whether it

is available in the tree. If a leaf node is found, the corresponding inde2 will be written to thecompressed data. The decoder could be reali4ed with a simple table, because the decoder does

not need the search function.

The si4e of the dictionary is growing during the coding process, so that the si4e for addressing

the table would increase continuously. In parallel the requirements for storing and searching

would be also enlarged permanently. # limitation of the dictionary and corresponding update

mechanisms are required.

Z6F is the base for other compression methods li!e the widespread Z> used e.g. for I(

graphics.

Page 30: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 30/42

Lempel-Ziv 1978

8ne large problem with the >T++ method is that it does not use the coding space

e9ciently, i!e! there are length and o2set values that never get used! 3f the history

bu2er contains multiple copies of a string, only the latest occurrance is needed, but

they all ta0e space in the o2set value space! *ach duplicate string wastes one o2set

value!

To get higher efficiency, we have to create a real dictionary. 8trings are added to the codeboo!

only once. There are no duplicates that waste bits "ust because they e2ist. #lso, each entry in the

codeboo! will have a specific length, thus only an inde2 to the codeboo! is needed to specify a

string 0phrase1. In Z66 the length and offset values were handled more or less as disconnected

variables although there is correlation. Because they are now handled as one entity, we can

e2pect to do a little better in that regard also.

Z6Ftype compressors use this !ind of a dictionary. The ne2t part of the message 0the

loo!ahead buffer contents1 is searched from the dictionary and the ma2imumlength match isreturned. The output code is an inde2 to the dictionary. If there is no suitable entry in the

dictionary, the ne2t input symbol is sent as a literal symbol. The dictionary is updated after each

symbol is encoded, so that it is possible to build an identical dictionary in the decompression

code without sending additional data.

&ssentially, strings that we have seen in the data are added to the dictionary. To be able to

constantly adapt to the message statistics, the dictionary must be trimmed down by discarding

the oldest entries. This also prevents the dictionary from becaming full, which would decrease

the compression ratio. This is handled automatically in Z66 by its use of a history buffer 0a

sliding window1. (or Z6F it must be implemented separately. Because the decompression codeupdates its dictionary in sychroni4ation with the compressor the code remains uniquely

decodable.

LZ78 

"bed spreaders spread spreads on beds" 

17

18

19

20

21

Page 31: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 31/42

22

23

24

25

26

27

28

29

30

31

32

33

Encodin'

#t the beginning of encoding the dictionary is empty. In order to e2plain the principle of

encoding, let's consider a point within the encoding process, when the dictionary already

contains some strings.

>e start analy4ing a new prefi2 in the charstream, beginning with an empty prefi2. If itscorresponding string 0prefi2 the character after it %1 is present in the dictionary, the prefi2

is e2tended with the character . This e2tending is repeated until we get a string which is not

 present in the dictionary. #t that point we output two things to the codestream: the code word that represents the prefi2 %, and then the character . Then we add the whole string 0%1 to the

dictionary and start processing the ne2t prefi2 in the charstream.

# special case occurs if the dictionary doesn't contain even the starting onecharacter string 0for

e2ample, this always happens in the first encoding step1. In that case we output a special codeword that represents an empty string, followed by this character and add this character to the

dictionary.

The output from this algorithm is a sequence of code wordcharacter pairs 0>,1. &ach time a

 pair is output to the codestream, the string from the dictionary corresponding to > is e2tendedwith the character  and the resulting string is added to the dictionary. This means that when a

new string is being added, the dictionary already contains all the substrings formed by removing

characters from the end of the new string.

The encodin' l'orithm

Page 32: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 32/42

7. #t the start, the dictionary and % are empty?

-.  :H ne2t character in the charstream?

=. Is the string % present in the dictionaryO

a. if it is, % :H % 0e2tend % with 1?

 b. if not,

i. output these two ob"ects to the codestream:

the code word corresponding to % 0if % is empty, output a 4ero1?

, in the same form as input from the charstream?

ii. add the string % to the dictionary?

iii. % :H empty?

c. are there more characters in the charstreamO

if yes, return to step -?

if not:

i. if % is not empty, output the code word corresponding to %?

ii. &CD.

Decodin'

#t the start of decoding the dictionary is empty. It gets reconstructed in the process ofdecoding. In each step a pair code wordcharacter 0>,1 is read from the codestream.

The code word always refers to a string already present in the dictionary. The string.> 

and  are output to the charstream and the string 0string.>1 is added to the dictionary.

#fter the decoding, the dictionary will loo! e2actly the same as after the encoding.

The decodin' l'orithm

7. #t the start the dictionary is empty?

-. > :H ne2t code word in the codestream?

=.  :H the character following it?

Page 33: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 33/42

5. output the string.> to the codestream 0this can be an empty string1, and then

output ?

R. add the string.> to the dictionary?

. are there more code words in the codestreamO

if yes, go bac! to step -?

if not, &CD.

An e*mple

The encoding process is presented in Table 7.

o The column Step indicates the number of the encoding step. &ach encoding step

is completed when the step =.b. in the encoding algorithm is e2ecuted.

o The column Pos indicates the current position in the input data.

o The column Dictionry shows what string has been added to the dictionary. The

inde2 of the string is equal to the step number.

o The column O$tp$t presents the output in the form 0>,1.

o The output of each step decodes to the string that has been added to the dictionary.

harstream to be encoded:

Pos . 1 3 4 5 6 7 8 9

Chr # B B B # B #

Table 7: The encoding process

Step Pos Dictionry O$tp$t

.# 7 # 0<,#1

1# - B 0<,B1

3# = B 0-,1

4# R B # 0=,#15# F B # 0-,#1

Page 34: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 34/42

+empel-i0elch (+-0!

The Z> compression method is derived from Z6F as introduced by $acob Ziv and #braham

empel. It was invented by Terry #. >elch in 7EF5 who had published his considerations in the

article ;# Technique for )igh%erformance Data ompression;.

#t that time Terry #. >elch was employed in a leading position at the 8perry Aesearch enter.

The Z> method is covered by patents valid for a number of countries, e.g. in N8#, &urope and

$apan. +eanwhile Nnisys holds the rights, but there are probably more patents also from other

companies regarding Z>. 8ome of these patents e2pire in -<<= 0N8#1 and -<<5 0&urope,

$apane1.

Z> is an important part of a variety of data formats. raphic formats li!e gif , tif  0optional1 and

%ostscript 0optional1 are using Z> for entropy coding.

#undamental algorithm:

Z> is developing a dictionary that contains any byte sequence already coded. The compressed

data e2ceptionally consist of indices to this dictionary. Before starting, the dictionary is preset

with entries for the -R single byte symbols. #ny entry following represents sequences larger

than one byte.

The algorithm presented by Terry >elch defines mechanisms to create the dictionary and to

ensure that it will be identical for both the encoding and decoding process.

rithmetic $odin

#rithmetic coding is the most efficient method to code symbols according to the probability of

their occurrence. The average code length corresponds e2actly to the possible minimum given by

information theory. Deviations which are caused by the bitresolution of binary code trees does

not e2ist.

Page 35: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 35/42

In contrast to a binary )uffman code tree the arithmetic coding offers a clearly better

compression rate. Its implementation is more comple2 on the other hand.

Nnfortunately the usage is restricted by patents. #s far as !nown it is not allowed to use

arithmetic coding without acquiring licences.

#rithmetic coding is part of the $%& data format. #lternative to )uffman coding it will be used

for final entropy coding. In spite of its less efficiency )uffman coding remains the standard due

to the legal restrictions mentioned above.

>T1 Compression forString

Page 36: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 36/42

Page 37: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 37/42

!:W Encodin' Al'orithm1 nitiali4e tale 5ith sinle characterstrins2 6 7 8rst input character3 0H+" not end o input stream' $ 7 ne#t input character% ) 6 9 $ is in the strin tale 6 7 6 9 $/ "+S"

& output the code or 6 add 6 9 $ to the strin tale1; 6 7 $11 "N< 0H+"

12 output code or 6

Arithmetic Codin'

In arithmetic coding, a message is encoded as a real number in an interval from one to 4ero.

#rithmetic coding typically has a better compression ratio than )uffman coding, as it produces a

single symbol rather than several seperate codewords. #rithmetic coding is a lossless coding

technique. There are a few disadvantages of arithmetic coding. 3ne is that the whole codewordmust be received to start decoding the symbols, and if there is a corrupt bit in the codeword, the

entire message could become corrupt. #nother is that there is a limit to the precision of thenumber which can be encoded, thus limiting the number of symbols to encode within acodeword. There also e2ists many patents upon arithmetic coding, so the use of some of the

algorithms also call upon royalty fees.

)ere is the arithmetic coding algorithm, with an e2ample to aid understanding.

Page 38: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 38/42

7. 8tart with an interval [<, 71, divided into subintervals of all possible symbols to appear

within a message. +a!e the si4e of each subinterval proportional to the frequency at

which it appears in the message. &g:

Symbol Probbility Inter"l

a <.-[<.<,<.-1

 b <.=[<.-,

<.R1

c <.7[<.R,

<.1

d <.5[<.,

7.<1

-. >hen encoding a symbol, ;4oom; into the current interval, and divide it into subintervalsli!e in step one with the new range. &2ample: suppose we want to encode ;abd;. >e

;4oom; into the interval corresponding to ;a;, and divide up that interval into smaller

subintervals li!e before. >e now use this new interval as the basis of the ne2t symbol

encoding step.

Symbol New 22 Inter"l

a [<.<, <.<51

 b [<.<5, <.71

c [<.7, <.7<-1

d [<.7<-, <.-1

=. Aepeat the process until the ma2imum precision of the machine is reached, or all symbols

are encoded. To encode the ne2t character ;b;, we 4use the ;a; interval created before,and 4oom into the subinterval ;b;, and use that for the ne2t step. This produces:

Symbol New 2b2 Inter"l

a [<.7<-, <.7-71

 b [<.7-7, <.7R71

c [<.7R7, <.7<F1

d [<.7<F, <.-1

#nd lastly, the final result is:

Symbol New 2d2 Inter"l

a [<.7<F, <.7F51

Page 39: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 39/42

 b [<.7F5, <.7F<51

c [<.7F<5, <.7F5=-1

d [<.7F5=-, <.-1

5. Transmit some number within the latest interval to send the codeword. The number ofsymbols encoded will be stated in the protocol of the image format, so any number within[<.7<F, <.-1 will be acceptable.

To decode the message, a similar algorithm is followed, e2cept that the final number is given,

and the symbols are decoded sequentially from that.

'lgorithm 8verview

#rithmetic coding is similar to )uffman coding? they both achieve their compression by

reducing the average number of bits required to represent a symbol.

;i"en/

#n alphabet with symbols 8<, 87, ... 8n, where each symbol has a probability of occurrence of p<,

 p7, ... pn such that \pi H 7.

(rom the fundamental theorem of information theory, it can be shown that the optimal coding for 

8i requires 0pi]log-0pi11 bits.

+ore often than not, the optimal number of bits is fractional. Nnli!e )uffman coding, arithmetic

coding provides the ability to represent symbols with fractional bits.

8ince, \pi H 7, we can represent each probability, pi, as a unique nonoverlapping range of values between < and 7. There's no magic in this, we're "ust creating ranges on a probability line.

(or e2ample, suppose we have an alphabet 'a', 'b', 'c', 'd', and 'e' with probabilities of occurrence

of =<@, 7R@, -R@, 7<@, and -<@. >e can choose the following range assignments to each

symbol based on its probability:

 '>* %! Sample Symbol Janges

Symol 6roaility *ane

a &$W X$!$$, $!&$

b %6W X$!&$, $!O6

c 46W X$!O6, $!+$

d %$W X$!+$, $!=$

Page 40: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 40/42

e 4$W X$!=$, %!$$

>here square brac!ets '[' and '^' mean the ad"acent number is included and parenthesis '0' and '1'

mean the ad"acent number is e2cluded.

Aanges assignments li!e the ones in this table can then be use for encoding and decoding strings

of symbols in the alphabet. #lgorithms using ranges for coding are often referred to as range

coders.

"ncodin Strins

By assigning each symbol its own unique probability range, it's possible to encode a single

symbol by its range. Nsing this approach, we could encode a string as a series of probability

ranges, but that doesn't compress anything. Instead additional symbols may be encoded by

restricting the current probability range by the range of a new symbol being encoded. The pseudocode below illustrates how additional symbols may be added to an encoded string by restricting

the string's range bounds.

lo9er #ond = %

,,er #ond = 1

9hile there re till !"#ol to en$ode

$rrent rnge = ,,er #ond - lo9er #ond

,,er #ond = lo9er #ond + ($rrent rnge × ,,er #ond o7 ne9 !"#ol)

lo9er #ond = lo9er #ond + ($rrent rnge × lo9er #ond o7 ne9 !"#ol)

end 9hile

#ny value between the computed lower and upper probability bounds now encodes the input

string.

E*mple/

&ncode the string ;ace; using the probability ranges from Table 7.

8tart with lower and upper probability bounds of < and 7.

&ncode 'a'

current range H 7 < H 7

upper bound H < 07 ] <.=1 H <.=

lower bound H < 07 ] <.<1 H <.<

&ncode 'c'

Page 41: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 41/42

current range H <.= <.< H <.=

upper bound H <.< 0<.= ] <.6<1 H <.-7<

lower bound H <.< 0<.= ] <.5R1 H <.7=R

&ncode 'e'

current range H <.-7< <.7=R H <.<6R

upper bound H <.7=R 0<.<6R ] 7.<<1 H <.-7<

lower bound H <.7=R 0<.<6R ] <.F<1 H <.7ER

The string ;ace; may be encoded by ny value within the probability range [<.7ER, <.-7<1.

It should become apparent from the e2ample that precision requirements increase as additional

symbols are encoded. 8trings of unlimited length require infinite precision probability range

 bounds. The section on implementation discusses how the need for infinite precision is handled.

<ecodin Strins

The decoding process must start with a an encoded value representing a string. By definition, the

encoded value lies within the lower and upper probability range bounds of the string it

represents. 8ince the encoding process !eeps restricting ranges 0without shifting1, the initial

value also falls within the range of the first encoded symbol. 8uccessive encoded symbols may

 be identified by removing the scaling applied by the !nown symbol. To do this, subtract out the

lower probability range bound of the !nown symbol, and multiply by the si4e of the symbols'

range.

Based on the discussion above, decoding a value may be performed following the steps in the pseudo code below:

en$oded vle = en$oded in,t

9hile tring i not 7ll! de$oded

identi7! the !"#ol $ontining en$oded vle 9ithin it rnge

re"ove e77e$t o7 !"#ol 7ro" en$oded vle

$rrent rnge = ,,er #ond o7 ne9 !"#ol - lo9er #ond o7 ne9 !"#ol

en$oded vle = (en$oded vle - lo9er #ond o7 ne9 !"#ol) ? $rrent rnge

end 9hile

E*mple/

Nsing the probability ranges from Table 7 decode the three character string encoded as <.-<.

Decode first symbol

<.-< is within [<.<<, <.=<1

Page 42: Data Compression Complete

7/23/2019 Data Compression Complete

http://slidepdf.com/reader/full/data-compression-complete 42/42

<.-< encodes 'a'

Aemove effects of 'a' from encode value

current range H <.=< <.<< H <.=<

encoded value H 0<.-< <.<1 _ <.=< H <.6 0rounded 1

Decode second symbol

<.6 is within [<.5R, <.6<1

<.6 encodes 'c'

Aemove effects of 'c' from encode value

current range H <.6< <.5R H <.=R

encoded value H 0<.6 <.5R1 _ <.=R H <.FF

Decode third symbol<.FF is within [<.F<, 7.<<1

<.FF encodes 'e'

The encoded string is ;ace;.

In case you were sleeping, this is the string that was encoded in the encoding e2ample.