Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively...

32
Greedy Algorithms (Huffman Coding) Slide 1

Transcript of Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively...

Page 1: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Greedy Algorithms(Huffman Coding)

Slide 1

Page 2: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Huffman Coding

• A technique to compress data effectively – Usually between 20%-90% compression

• Lossless compression– No information is lost– When decompress, you get the original file

Slide 2

Original file

Compressed file

Huffman coding

Page 3: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Huffman Coding: Applications

• Saving space– Store compressed files instead of original files

• Transmitting files or data– Send compressed data to save transmission time and power

• Encryption and decryption– Cannot read the compressed file without knowing the “key”

Slide 3

Original file

Compressed file

Huffman coding

Page 4: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Main Idea: Frequency-Based Encoding

• Assume in this file only 6 characters appear

– E, A, C, T, K, N

• The frequencies are: Character Frequency

E 10,000

A 4,000

C 300

T 200

K 100

N 100• Option I (No Compression)

– Each character = 1 Byte (8 bits)– Total file size = 14,700 * 8 = 117,600 bits

• Option 2 (Fixed size compression)– We have 6 characters, so we need

3 bits to encode them– Total file size = 14,700 * 3 = 44,100 bits

Character Fixed Encoding

E 000

A 001

C 010

T 100

K 110

N 111

Page 5: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Main Idea: Frequency-Based Encoding(Cont’d)

• Assume in this file only 6 characters appear

– E, A, C, T, K, N

• The frequencies are: Character Frequency

E 10,000

A 4,000

C 300

T 200

K 100

N 100• Option 3 (Huffman compression)– Variable-length compression– Assign shorter codes to more frequent characters and

longer codes to less frequent characters

– Total file size:

Char.

HuffmanEncoding

E 0

A 10

C 110

T 1110

K 11110

N 11111

(10,000 x 1) + (4,000 x 2) + (300 x 3) + (200 x 4) + (100 x 5) + (100 x 5) = 20,700 bits

Page 6: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Huffman Coding

• A variable-length coding for characters– More frequent characters shorter codes

– Less frequent characters longer codes

• It is not like ASCII coding where all characters have the same coding length (8 bits)

• Two main questions – How to assign codes (Encoding process)?

– How to decode (from the compressed file, generate the original file) (Decoding process)?

Slide 6

Page 7: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Decoding for fixed-length codes is much easier

Slide 7

Character Fixed-length Encoding

E 000

A 001

C 010

T 100

K 110

N 111

010001100110111000

010 001 100 110 111 000

Divide into 3’s

C A T K N E

Decode

Page 8: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Decoding for variable-length codes is not that easy…

Slide 8

Character Variable-length Encoding

E 0

A 00

C 001

… …

… …

… …

000001

It means what???

EEEC EAC AEC

Huffman encoding guarantees to avoid this uncertainty …Always have a single decoding

Page 9: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Huffman Algorithm

• Step 1: Get Frequencies– Scan the file to be compressed and count the occurrence of

each character

– Sort the characters based on their frequency

• Step 2: Build Tree & Assign Codes– Build a Huffman-code tree (binary tree)

– Traverse the tree to assign codes

• Step 3: Encode (Compress)– Scan the file again and replace each character by its code

• Step 4: Decode (Decompress)

– Huffman tree is the key to decompress the file

Slide 9

Page 10: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Step 1: Get Frequencies

Slide 10

Eerie eyes seen near lake.Input File:

Page 11: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Step 2: Build Huffman Tree & Assign Codes

• It is a binary tree in which each character is a leaf node– Initially each node is a separate root

• At each step– Select two roots with smallest frequency and connect them to

a new parent (Break ties arbitrary) [The greedy choice]

– The parent will get the sum of frequencies of the two child nodes

• Repeat until you have one root

Slide 11

Page 12: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Example

Slide 12

Each char. has a leaf node with its frequency

Page 13: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Find the smallest two frequencies…Replace them with their parent

Slide 13

Page 14: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 14

Find the smallest two frequencies…Replace them with their parent

Page 15: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 15

Find the smallest two frequencies…Replace them with their parent

Page 16: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 16

Find the smallest two frequencies…Replace them with their parent

Page 17: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 17

Find the smallest two frequencies…Replace them with their parent

Page 18: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 18

Find the smallest two frequencies…Replace them with their parent

Page 19: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 19

Find the smallest two frequencies…Replace them with their parent

Page 20: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 20

Find the smallest two frequencies…Replace them with their parent

Page 21: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 21

Find the smallest two frequencies…Replace them with their parent

Page 22: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 22

Find the smallest two frequencies…Replace them with their parent

Page 23: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 23

Find the smallest two frequencies…Replace them with their parent

Page 24: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 24

Now we have a single root…This is the Huffman Tree

Page 25: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Lets Analyze Huffman Tree

Slide 25

• All characters are at the leaf nodes

• The number at the root = # of characters in the file

• High-frequency chars (E.g., “e”) are near the root

• Low-frequency chars are far from the root

Page 26: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Lets Assign Codes

• Traverse the tree

– Any left edge add label 0

– As right edge add label 1

• The code for each character is its root-to-leaf label sequence

Slide 26

Page 27: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 27

01

0

0

0

0

0

0 0

1

1

11

1

1

1

10

001 1

• Traverse the tree

– Any left edge add label 0

– As right edge add label 1

• The code for each character is its root-to-leaf label sequence

Lets Assign Codes

Page 28: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Slide 28

• Traverse the tree

– Any left edge add label 0

– As right edge add label 1

• The code for each character is its root-to-leaf label sequence

Lets Assign Codes

Coding Table

Page 29: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Huffman Algorithm

• Step 1: Get Frequencies– Scan the file to be compressed and count the occurrence of

each character

– Sort the characters based on their frequency

• Step 2: Build Tree & Assign Codes– Build a Huffman-code tree (binary tree)

– Traverse the tree to assign codes

• Step 3: Encode (Compress)– Scan the file again and replace each character by its code

• Step 4: Decode (Decompress)

– Huffman tree is the key to decompess the file

Slide 29

Page 30: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Generate the encoded file

Step 3: Encode (Compress) The File

Slide 30

Eerie eyes seen near lake.

Input File:

Coding Table

+

000010 1100000110 ….

Notice that no code is prefix to any other code Ensures the decoding will be unique (Unlike Slide 8)

Page 31: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Step 4: Decode (Decompress)

• Must have the encoded file + the coding tree

• Scan the encoded file– For each 0 move left in the tree– For each 1 move right– Until reach a leaf node Emit that character and go back to the root

Slide 31

000010 1100000110 ….

+

Eerie …

Generate the original file

Page 32: Greedy Algorithms (Huffman Coding) Slide 1. Huffman Coding A technique to compress data effectively –Usually between 20%-90% compression Lossless compression.

Huffman Algorithm

• Step 1: Get Frequencies– Scan the file to be compressed and count the occurrence of

each character

– Sort the characters based on their frequency

• Step 2: Build Tree & Assign Codes– Build a Huffman-code tree (binary tree)

– Traverse the tree to assign codes

• Step 3: Encode (Compress)– Scan the file again and replace each character by its code

• Step 4: Decode (Decompress)

– Huffman tree is the key to decompess the file

Slide 32