Compression Algorithms CSCI 2720 Spring 2005 Eileen Kraemer.

Compression Algorithms

CSCI 2720

Spring 2005

Eileen Kraemer

When we last met …

We looked at string encoding And noted that if we use same number of bits

per character in our alphabet, then the number of bits required to encode a

character of the alphabet is • log2(ceil(sizeof(alphabet))

And we don’t need to transmit or store the mapping from encodings to characters

What if …

the string we encode doesn’t use all the letters in the alphabet?

log2(ceil(sizeof(set_of_characters_used))But then also need to store / transmit the

mapping from encodings to characters… and is typically close to size of alphabet

And we also looked at: …

Huffman Encoding Assumes encoding on a per-character basis Observation: assigning shorter codes to frequently

used characters (which requires assigning longer codes to rarely used characters) can result in overall shorter encodings of strings

Problem: • when decoding, need to know how many bits to read off

for each character. Solution:

• Choose an encoding that ensures that no character encoding is the prefix of any other character encoding. An encoding tree has this property.

A Huffman Encoding Tree

12

21

9

7

43

5

23

A T R N

E

0 1

0 1

0 1 0 1

12

21

9

7

43

5

23

A T R N

E

0 1

0 1

0 1 0 1

A 000

T 001

R 010

N 011

E 1

Weighted path length

A 000

T 001

R 010

N 011

E 1

Weighted path = Len(code(A)) * f(A) +

Len(code(T)) * f(T) + Len(code(R) ) * f(R) +

Len(code(N)) * f(N) + Len(code(E)) * f(E)

= (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1)

= 9 + 6 + 9 + 12 + 9 = 45

Claim (proof in text) : no other encoding can result in a shorter weighted path length

Taking a step back …

Why do we need compression? rate of creation of image and video data image data from digital camera

today 1k by 1.5 k is common = 1.5 mbytes need 2k by 3k to equal 35mm slide = 6 mbytes

video at even low resolution of 512 by 512 and 3 bytes per pixel, 30 frames/second

Compression basics

video data rate 23.6 mbytes/second 2 hours of video = 169 gigabytes

mpeg-1 compresses 23.6 mbytesdown to 187 kbytes per second 169 gigabytes down to 1.3 gigabytes

compression is essential for both storage and transmission of data

Compression basics

compression is very widely used jpeg, giff for single images mpeg1, 2, 3, 4 for video sequence zip for computer data mp3 for sound

based on two fundamental principles spatial coherence and temporal coherence

similarity with spatial neighbour similarity with temporal neighbour

Basics of compression

character = basic data unit in the input stream

represents byte, bit, etc. strings = sequences of characters encoding = compression decoding = decompression codeword = data elements used to represent

input characters or character strings codetable = list of codewords

Codeword

encoding/compression takes characters/strings as input and use

codetable to decide on which codewords to produce

decoder/decompressor takes codewords as input and uses same codetable to

decide on which characters/strings to produce

Codetable

clearly both encoder and decoder must pass the encoded data as a series of codewords

also must pass the codetablethe codetable can be passed explicitly or

implicitlythat is we either

pass it across agree on it beforehand (hard wired) recreate it from the codewords (clever!)

Basic definitions

compression ratio = size of original data / compressed data basically higher compression ratio the better

lossless compression output data is exactly same as input data essential for encoding computer processed data

lossy compression output data not same as input data acceptable for data that is only viewed or heard

Lossless versus lossy

human visual system less sensitive to high frequency losses and to losses in color

lossy compression acceptable for visual data

degree of loss is usually a parameter of the compression algorithm

tradeoff - loss versus compression higher compression => more loss lower compression => less loss

Symmetric versus asymmetric

symmetric encoding time == decoding time essential for real-time applications (ie. video or

audio on demand)

asymmetric encoding time >> decoding ok for write-once, read-many situations

Entropy encoding

compression that does not take into account what is being compressed

normally is also lossless encodingmost common types of entropy encoding

run length encoding Huffman encoding modified Huffman (fax…) Lempel Ziv

Source encoding

takes into account type of data (ie. visual)normally is lossy but can also be losslessmost common types in use:

JPEG, GIF = single images MPEG = sequence of images (video) MP3 = sound sequence

often uses entropy encoding as a sub-routine

Run length encoding

one of simplest and earliest types of compression

take account of repeating data (called runs) runs are represented by a count along with the

original data eg. AAAABB => 4A2B

do you run length encode a single character? no, use a special prefix character to represent

start of runs

Run length encoding

runs are represented as <prefix char><repeat count><run char>

prefix char itself becomes<prefix char>1<prefix char>

want a prefix char that is not too common an example early use is MacPaint file formatrun length encoding is lossless and has

fixed length codewords

MacPaint File Format

Run length encoding

works best for images with solid background

good example of such an image is a cartoon

does not work as well for natural imagesdoes not work well for English texthowever, is almost always a part of a

larger compression system

Huffman encoding

assume we know the frequency of each character in the input stream

then encode each character as a variable length bit string, with the length inversely proportional to the character frequency

variable length codewords are used; early example is Morse code

Huffman produced an algorithm for assigning codewords optimally

Huffman encoding

input = probabilities of occurrence of each input character (frequencies of occurrence)

output is a binary tree each leaf node is an input character each branch is a zero or one bit codeword for a leaf is the concatenation of bits

for the path from the root to the leaf codeword is a variable length bit string

a very good compression ratio (optimal)?

Huffman encoding

Basic algorithmMark all characters as free tree nodes

While there is more than one free node

Take two nodes with lowest freq. of occurrence

Create a new tree node with these nodes as children and with freq. equal to the sum of their freqs.

Remove the two children from the free node list.

Add the new parent to the free node list

Last remaining free node is the root of the binary tree used for encoding/decoding

Huffman example

a series of colours in an 8 by 8 screencolours are red, green, cyan, blue,

magenta, yellow, and blacksequence is

rkkkkkkk gggmcbrr kkkrrkkk bbbmybbr kkrrrrgg gggggggr kkbcccrr grrrrgrr

Huffman example

Fixed versus variable length codewordsrun length codewords are fixed lengthHuffman codewords are variable lengthlength inversely proportional to frequencyall variable length compression schemes

have the prefix propertyone code can not be the prefix of anotherbinary tree structure guarantees that this is

the case (a leaf node is a leaf node!)

Huffman encoding

advantages maximum compression ratio assuming correct

probabilities of occurrence easy to implement and fast

disadvantages need two passes for both encoder and decoder

one to create the frequency distribution one to encode/decode the data

can avoid this by sending tree (takes time) or by having unchanging frequencies

Modified Huffman encoding

if we know frequency of occurrences, then Huffman works very well

consider case of a fax; mostly long white spaces with short bursts of black

do the following run length encode each string of bits on a line Huffman encode these run length codewords use a predefined frequency distribution

combination run length, then Huffman

Lempel Ziv Welsch (LZW)

previous methods worked only on characters LZW works by encoding strings some strings are replaced by a single codeword for now assume codeword is fixed (12 bits) for 8 bit characters, first 256 (or less) entries in

table are reserved for the characters rest of table (257-4096) represent strings

LZW compression

trick is that strings to codeword mapping is created dynamically by the encoder

also recreated dynamically by the decoderneed not pass the code table between the

twois a lossless compression algorithmdegree of compression hard to predictdepends on data, but gets better as

codeword table contains more strings

LZW encoder

Demonstrations

A nice animated version of Lempel-Ziv

LZW encoder example

compress the string BABAABAAA

LZW decoder

Lempel-Ziv compression

a lossless compression algorithmAll encodings have the same length

But may represent more than one character

Uses a “dictionary” approach – keeps track of characters and character strings already encountered

LZW decoder example

decompress the string <66><65><256><257><65><260>

LZW Issues

compression better as the code table growswhat happens when all 4096 locations in

string table are used?A number of options, but encoder and

decoder must agree to do the same thing do not add any more entries to table (as is) clear codeword table and start again clear codeword table and start again with larger

table/longer codewords (GIF format)

LZW advantages/disadvantages

advantages simple, fast and good compression can do compression in one pass dynamic codeword table built for each file decompression recreates the codeword table so

it does not need to be passed

disadvantages not the optimum compression ratio actual compression hard to predict

Entropy methods

all previous methods are lossless and entropy based

lossless methods are essential for computer data (zip, gnuzip, etc.)

combination of run length encoding/huffman is a standard tool

are often used as a subroutine by other lossy methods (Jpeg, Mpeg)

Lempel-Ziv compression

a lossless compression algorithmAll encodings have the same length

But may represent more than one character

Uses a “dictionary” approach – keeps track of characters and character strings already encountered

Compression Algorithms CSCI 2720 Spring 2005 Eileen Kraemer.

Documents

Transcript of Compression Algorithms CSCI 2720 Spring 2005 Eileen Kraemer.