Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files...

44
Data Compressor---Huffman Encoding and Decoding

Transcript of Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files...

Data Compressor---Huffman Encoding and Decoding

Huffman Encoding

• Compression• Typically, in files and messages,

• Each character requires 1 byte or 8 bits• Already wasting 1 bit for most purposes!

• Question• What’s the smallest number of bits that can be used to

store an arbitrary piece of text?

• Idea• Find the frequency of occurrence of each character• Encode Frequent characters short bit strings• Rarer characters longer bit strings

Huffman's Algorithm

• 1952• Repeatedly merges trees - maintains a forest• Tree weight - the sum of its leaves

frequencies• For C characters to code, start with C single

node trees

• Select two trees, T1 and T2, of smallest weights and merge them

• C - 1 merge operations

Huffman Encoding• Encoding

• Use a tree• Encode by following

tree to leaf• eg

• E is 00• S is 011

• Frequent charactersE, T 2 bit encodings

• Others A, S, N, O 3 bit encodings

Huffman Encoding• Encoding

• Use a tree• Inefficient in practice

• Use a direct-addressed lookuptable

? Finding the optimal encoding• Smallest number of bits to

represent arbitrary text

A 010

E 00

B

:

:

N

:

S

T

110

001

10

• A divide-and-conquer approach might have us asking which characters should appear in the left and right subtrees and trying to build the tree from the top down.

• A greedy approach places our n characters in n sub-trees and starts by combining the two least weight nodes into a tree which is assigned the sum of the two leaf node weights as the weight for its root node.

Huffman Encoding• Divide and conquer

• Decide on a root - n choices• Decide on roots for sub-trees - n choices• Repeat n times

O(n!)

• Greedy Approach• Sort characters by frequency• Form two lowest weight nodes into a sub-tree

• Sub-tree weight = sum of weights of nodes• Move new tree to correct place

Standard Coding Scheme

Binary Tree Representation

• For the character set of C characters, the standard fixed-length coding needs ┌log C┐ bits

• Fixed-length code can be represented by a binary tree where characters are stored only in leaf nodes - binary trie

• Each character path - start at the root, follow the branches, record 0 for the left branch and 1 for the right branch

• Optimal code is always a full tree - all nodes are either leaves or have two children

Representation by a Binary Trie

Improved Binary Trie

Prefix Code

• The fixed-length character code that has characters places only at the leaves guarantees that any bit sequence can be decoded unambiguously

• Prefix code - characters may have varying lengths as long as no character code is a prefix of another code

• That means that characters can be only in leafs

Optimal Prefix Code Tree

Optimal Prefix Code Cost

Huffman’s Algorithm Example - I

Huffman’s Algorithm Example - II

Huffman’s Algorithm Example - III

Huffman’s Algorithm Example - IV

Huffman’s Algorithm Example - V

Huffman’s Algorithm Example - VI

Huffman’s Algorithm Example-VII

Huffman Encoding - Operation

Initial sequenceSorted by frequency

Combine lowest twointo sub-tree

Move it to correctplace

After shifting sub-treeto its correct place ...

Huffman Encoding - Operation

Combine next lowestpair

Move sub-tree to correct place

Move the new tree to the correct place ...

Huffman Encoding - Operation

Now the lowest two are the“14” sub-tree and D

Combine and move to correct place

Move the new tree to the correct place ...

Huffman Encoding - Operation

Now the lowest two are thethe “25” and “30” trees

Combine and move to correct place

Huffman Encoding - Operation

Combine last two trees

• How do we decode a Huffman-encoded bit string? With these variable length strings, it's not possible to break up an encoded string of bits into characters!"

• The decoding procedure is deceptively simple. Starting with the first bit in the stream, one then uses successive bits from the stream to determine whether to go left or right in the decoding tree. When we reach a leaf of the tree, we've decoded a character, so we place that character onto the (uncompressed) output stream. The next bit in the input stream is the first bit of the next character.

Huffman Encoding - Decoding

Huffman Encoding - Time Complexity

• Sort keys O(n log n)

• Repeat n times• Form new sub-treeO(1)

• Move sub-tree O(logn)(binary search)

• Total O(n log n) • Overall O(n log n)