Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

17
Data Compression 1 Data Compression •File Compression •Huffman Tries D B C R 0 1 0 0 0 1 1 1 A ABRACADABRA 01011011010000101001011011010

Transcript of Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Page 1: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 1

Data Compression

• File Compression• Huffman Tries

D BC

R

0 1

0

00 1

1

1

A

ABRACADABRA 01011011010000101001011011010

Page 2: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 2

File Compression

• Text files are usually stored by representing each character with an 8-bit ASCII code (type man ascii in a Unix shell to see the ASCII encoding)

• The ASCII encoding is an example of fixed-length encoding, where each character is represented with the same number of bits

• In order to reduce the space required to store a text file, we can exploit the fact that some characters are more likely to occur than others

• variable-length encoding uses binary codes of different lengths for different characters; thus, we can assign fewer bits to frequently used characters, and more bits to rarely used characters.

Page 3: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 3

File Compression: Example

• An Encoding Exampletext: java encoding: a = “0”, j = “11”, v = “10”encoded text: 110100 (6 bits)

• How to decode (problems in ambiguity)?encoding: a = “0”, j = “01”, v = “00”

encoded text: 010000 (6 bits)

could be "java", or "jvv", or "jaaaa"

Page 4: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 4

Encoding Trie

• To prevent ambiguities in decoding, we require that the encoding satisfies the prefix rule: no code is a prefix of another.– a = “0”, j = “11”, v = “10” satisfies the prefix rule– a = “0”, j = “01”, v= “00” does not satisfy the prefix rule (the code of 'a'

is a prefix of the codes of 'j' and 'v')

• We use an encoding trie to satisfy this prefix rule.– the characters are stored at the external nodes– a left child (edge) means 0– a right child (edge) means 1

A = 010

B= 11

C= 00

D= 10

R= 011

D BC

R

0 1

0

00 1

1

1

A

Page 5: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 5

Example of Decoding

• trie A = 010

B= 11

C= 00

D= 10

R= 011

D BC

R

0 1

0

00 1

1

1

A

• encoded text: 01011011010000101001011011010• text:

ABRACADABRA

Page 6: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 6

Trie this!10000111110010011000111011110001010100110100

E NKCS BTW

RO

0

0

0000

0

0

0

1

1111

1

11

1

Page 7: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 7

Optimal Compression• An issue with encoding tries is to insure that the encoded text is as

short as possible:

D BC

R

0 1

0

00 1

1

1

A

ABRACADABRA0101101101000010100101101010 29 bits

B RA

D

0 1

0

00 1

1

1

C

ABRACADABRA001011000100001100101100 24 bits

Page 8: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 8

Huffman Encoding Trie

1 1

C D

5 2 2

B R

5 2 2

1 1

2

5

2 2 1 1

24

2 2 1 1

24

5 6

frequency

character

ABRACADABRA

A B R

C D

B R C D

A

A

B R C D

A

Page 9: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 9

Huffman Encoding Trie (contd.)

B R D

A

0

1

0 1

0

0

11

C

5

11

4 2

6

2 2 1 1

2 2 1 1

24

5 6A

B R C D

Page 10: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 10

Final Huffman Encoding Trie

A B R A C A D A B R A 0 100101 0 110 0 111 0 100 1010

23 bits

B R D

A

0

1

0 1

0

0

11

C

5

11

4 2

6

2 2 1 1

Page 11: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 11

Another Huffman Encoding Trie

1 1

C D

5 2 2

B R

5 2 2

1 1

2

5

frequency

characterABRACADABRA

A B R

C D

A

A

1 1

2

C D

2R

42B

Page 12: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 12

Another Huffman Encoding Trie

5

A

1 1

2

C D

2R

42B

1 1

2

C D

2R

42B

65

A

Page 13: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 13

Another Huffman Encoding Trie

11

5

A

1 1

2

C D

2R

42B

65

A

1 1

2

C D

2R

42B

6

Page 14: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 14

Another Huffman Encoding Trie

A B R A C A D A B R A 010110 0 1100 0 1111 0 10 110 0 23 bits

11

1 1

2

C D

2R

42B

65

A

0 1

1

1

1

0

0

0

Page 15: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Construction Algorithm

Algorithm Huffman(X):Input: String X of length nOutput: Encoding trie for XCompute the frequency f(c) of each character c of X.Initialize a priority queue Q.for each character c in X do Create a single-node tree T storing c

Q. insertItem(f(c), T)while Q.size() > 1 do

f1 ¨ Q. minKey()T1 ¨ Q. removeMinElement()f2 ¨ Q.minKey()

T2 ¨ Q. removeMinElement() Create a new tree T with left subtree T1 and right subtree T2.

Q.insertItem(f1 + f2, T) return tree Q.removeMinElement()

Page 16: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 16

Construction Algorithm (contd)

• Running time for a text of length n with k distinct characters: O(n + klogk)

• Typically, k is O(1) (e.g., ASCII characters) and the algorithm runs in O(n) time.

• With a Huffman encoding trie, the encoded text has minimal length

Page 17: Data Compression1 File Compression Huffman Tries ABRACADABRA 01011011010000101001011011010.

Data Compression 17

Image Compression

• we can use Huffman encoding also for binary files (bitmaps, executables, etc.)

• common groups of bits are stored at the leaves• Example of an encoding suitable for b/w bitmaps

000

0

0

1

11

1

010 101

111

0 1

001 100

0

0 1

011 110

0

0 1