The LZ family LZ77 LZR LZSS LZB LZH – used by zip and unzip LZ78 LZW – Unix compress LZC –...

22
The LZ family LZ77 LZR LZSS LZB LZH – used by zip and unzip LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG

Transcript of The LZ family LZ77 LZR LZSS LZB LZH – used by zip and unzip LZ78 LZW – Unix compress LZC –...

Page 1: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

The LZ family

LZ77 LZR LZSS LZB LZH – used by zip and unzip

LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG

Page 2: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

Overview of LZ family

To demonstrate: simple alphabet containing only two

letters, a and b, and create a sample stream of text

Page 3: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

LZ family overview

Rule: Separate this stream of characters into pieces of text so that the shortest piece of data is the string of characters that we have not seen so far.

Page 4: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

Sender : The Compressor

Before compression, the pieces of text from the breaking-down process are indexed from 1 to n:

Page 5: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

LZ

indices are used to number the pieces of data. The empty string (start of text) has index 0. The piece indexed by 1 is a. Thus a, together with

the initial string, must be numbered Oa. String 2, aa, will be numbered 1a, because it

contains a, whose index is 1, and the new character a.

Page 6: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

LZ

the process of renaming pieces of text starts to pay off. Small integers replace what were once

long strings of characters. can now throw away our old stream of

text and send the encoded information to the receiver

Page 7: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

Bit Representation of Coded Information

Now, want to calculate num bits needed each chunk is an int and a letter num bits depends on size of table

permitted in the dictionary every character will occupy 8 bits because

it will be represented in US ASCII format

Page 8: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

Compression good?

in a long string of text, the number of bits needed to transmit the coded information is small compared to the actual length of the text.

example: 12 bits to transmit the code 2b instead of 24 bits (8 + 8 + 8) needed for the actual text aab.

Page 9: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

Receiver: The Decompressor (Implementation

receiver knows exactly where boundaries are, so no problem in reconstructing the stream of text.

Preferable to decompress the file in one pass; otherwise, we will encounter a problem with temporary storage..

Page 10: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

Lempel-Ziv applet

See http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic23/#JavaApplet

Page 11: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

Lempel Ziv Welsch (LZW)

previous methods worked only on characters LZW works by encoding strings some strings are replaced by a single

codeword for now assume codeword is fixed (12 bits) for 8 bit characters, first 256 (or less) entries

in table are reserved for the characters rest of table (257-4096) represent strings

Page 12: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

LZW compression

trick is that string-to-codeword mapping is created dynamically by the encoder

also recreated dynamically by the decoder need not pass the code table between the

two is a lossless compression algorithm degree of compression hard to predict depends on data, but gets better as

codeword table contains more strings

Page 13: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

LZW encoder

Initialize table with single character stringsSTRING = first input characterWHILE not end of input stream

CHARACTER = next input characterIF STRING + CHARACTER is in the string table

STRING = STRING + CHARACTERELSE

Output the code for STRINGAdd STRING + CHARACTER to the string

tableSTRING = CHARACTER

END WHILEOutput code for string

Page 14: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

Demonstrations

Another animated LZ algorithm … http://www.data-compression.com/lempelziv.html

Page 15: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

LZW encoder example

compress the string BABAABAAA

Page 16: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

LZW decoder

Page 17: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

Lempel-Ziv compression

a lossless compression algorithm All encodings have the same length

But may represent more than one character

Uses a “dictionary” approach – keeps track of characters and character strings already encountered

Page 18: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

LZW decoder example

decompress the string <66><65><256><257><65><260>

Page 19: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

LZW Issues

compression better as the code table grows

what happens when all 4096 locations in string table are used?

A number of options, but encoder and decoder must agree to do the same thing do not add any more entries to table (as is) clear codeword table and start again clear codeword table and start again with

larger table/longer codewords (GIF format)

Page 20: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

LZW advantages/disadvantages

advantages simple, fast and good compression can do compression in one pass dynamic codeword table built for each file decompression recreates the codeword

table so it does not need to be passed disadvantages

not the optimum compression ratio actual compression hard to predict

Page 21: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

Entropy methods

all previous methods are lossless and entropy based

lossless methods are essential for computer data (zip, gnuzip, etc.)

combination of run length encoding/huffman is a standard tool

are often used as a subroutine by other lossy methods (Jpeg, Mpeg)

Page 22: The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG.

Lempel-Ziv compression

a lossless compression algorithm All encodings have the same length

But may represent more than one character

Uses a “dictionary” approach – keeps track of characters and character strings already encountered