HUFFMAN CODING 2

101
cs2420 | Introduction to Algorithms and Data Structures | Spring 2015 HUFFMAN CODING 2 1

Transcript of HUFFMAN CODING 2

Page 1: HUFFMAN CODING 2

cs2420 | Introduction to Algorithms and Data Structures | Spring 2015HUFFMAN CODING 2

1

Page 2: HUFFMAN CODING 2

administrivia…

2

Page 3: HUFFMAN CODING 2

3

-assignment 12 is due next TUESDAY

-final project

-upcoming lectures

Page 4: HUFFMAN CODING 2

last time…

4

Page 5: HUFFMAN CODING 2

number encodings

5

Page 6: HUFFMAN CODING 2

binary-each bit represents power of 2

-sum up all the bits that are on-128 + 16 + 2 + 1 = 147

-how can we convert the other way?

6

1 0 0 1 0 0 1 1

27 26 25 24 23 22 21 20

128 64 32 16 8 4 2 1on off off on off off on on

Page 7: HUFFMAN CODING 2

ASCII-each character corresponds to one byte

-remember, a byte is just an 8-bit number! (0-255)

-for example:00100000 = 32 = ‘ ‘ (blank space) 00111011 = 59 = ; 01000001 = 65 = A 01000010 = 66 = B

7

Page 8: HUFFMAN CODING 2

hexadecimal-hexadecimal is the base-16 number system

-we only have 10 digits (0-9), so to use a number base greater than 10 we need more symbols

-in hex, we use the letters A through F-A represents the value ten -F represents the value fifteen

8

Page 9: HUFFMAN CODING 2

hex to binary-each hex digit is a specific 4-bit sequence

0 = 0000 1 = 0001 … E = 1110 F = 1111

-converting from hex to binary is as simple as representing each digit with its bit-sequence

12 EF = 0001 0010 1110 1111

-a single byte is two hex digits-the bytes in the above are 12 and EF

9

Page 10: HUFFMAN CODING 2

10

what is the hex value of these 8 bits? 1110 1000

A) 18 B) EF C) E8 D) A4

Page 11: HUFFMAN CODING 2

11

how many different values can 3 bits hold?

A) 3 B) 4C) 7 D) 8 E) 15 F) 16

Page 12: HUFFMAN CODING 2

file compression-suppose we the following string stored in a text file:

ddddddddddddabc

-how many bytes of disk space does it take to store these 15 characters using ASCII?

-is there any way to represent this file in fewer bytes?

12

Page 13: HUFFMAN CODING 2

13

d

c

ba

a: 000 b: 001 c: 01 d: 1

“ddddddddddddabc” takes 15 bytes (120 bits) in ASCII

11111111111100000101 is less than 3 bytes (20 bits)

why is the d near the top of the tree?

Page 14: HUFFMAN CODING 2

14

what string do these bits encode? 0 1 1 0 0 0 1 0 1 1 1 0

A) word B) wow C) wool D) were

o

wr

l

H‘ ‘ ed

Page 15: HUFFMAN CODING 2

Huffman’s algorithm

15

Page 16: HUFFMAN CODING 2

16

1.count occurrences of each character in a string

2. for each character, create a leaf node to store the character and count (ie. weight)

3.place each leaf node into a priority queue

4.construct the binary trie

5.write header with binary trie information

6.compress string using character codes

Page 17: HUFFMAN CODING 2

today

17

Page 18: HUFFMAN CODING 2

18

-go through Huffman’s algorithm for compression-using a priority queue -tie-breakers -writing the header -compressing the string

-decompression

Page 19: HUFFMAN CODING 2

19

1.count occurrences of each character in a string

2. for each character, create a leaf node to store the character and count (ie. weight)

3.place each leaf node into a priority queue

4.construct the binary trie

5.write header with binary trie information

6.compress string using character codes

Page 20: HUFFMAN CODING 2

20

I heart data.

Page 21: HUFFMAN CODING 2

21

character frequency

I 1

‘’ 2

h 1

e 1

a 3

r 1

t 2

d 1

. 1

I heart data.

Page 22: HUFFMAN CODING 2

22

1.count occurrences of each character in a string

2. for each character, create a leaf node to store the character and count (ie. weight)

3.place each leaf node into a priority queue

4.construct the binary trie

5.write header with binary trie information

6.compress string using character codes

Page 23: HUFFMAN CODING 2

23

‘’:2 h:1 e:1

a:3 r:1 t:2 d:1

.:1

I:1

I heart data.

character frequency

I 1

‘’ 2

h 1

e 1

a 3

r 1

t 2

d 1

. 1

Page 24: HUFFMAN CODING 2

24

-but, let’s think about what happens when we write our compressed codes

-files must be a whole number of bytes-but, the length of our compressed string will not necessarily be a factor of 8 -what do we do?

Page 25: HUFFMAN CODING 2

25

-to solve this, we add a special EOF leaf node

-then, if we encounter this during decompression, we know we are done

-and we ignore the remaining bits

EOF:1

Page 26: HUFFMAN CODING 2

26

‘’:2 h:1 e:1

a:3 r:1 t:2 d:1

.:1

I:1

I heart data.

character frequency

I 1

‘’ 2

h 1

e 1

a 3

r 1

t 2

d 1

. 1

EOF 1

EOF:1

Page 27: HUFFMAN CODING 2

27

1.count occurrences of each character in a string

2. for each character, create a leaf node to store the character and count (ie. weight)

3.place each leaf node into a priority queue

4.construct the binary trie

5.write header with binary trie information

6.compress string using character codes

Page 28: HUFFMAN CODING 2

28

-how do we determine the highest priority?

-what happens if two nodes have the same priority?-need a tie-breaker! -use ASCII value for character to break tie

-lowest values have highest priority

Page 29: HUFFMAN CODING 2

29

I heart data.

PQ:

‘’:2 h:1 e:1 a:3 r:1 t:2 d:1 .:1I:1 EOF:1

Page 30: HUFFMAN CODING 2

30

I heart data.

PQ:

‘’:2

h:1 e:1

a:3

r:1

t:2

d:1 .:1I:1 EOF:1

Page 31: HUFFMAN CODING 2

31

character frequency ASCII

I 1 73

‘’ 2 32

h 1 72

e 1 101

a 3 97

r 1 114

t 2 116

d 1 100

. 1 46

EOF 1 0

I heart data.

Page 32: HUFFMAN CODING 2

32

I heart data.

PQ:

‘’:2

h:1 e:1

a:3

r:1

t:2

d:1 .:1I:1 EOF:1

73 104 101 114 100 46 0

Page 33: HUFFMAN CODING 2

33

I heart data.

PQ:

‘’:2

h:1 e:1

a:3

r:1

t:2

d:1 .:1I:1 EOF:1

73 104 101 114 100 46 0

Page 34: HUFFMAN CODING 2

34

I heart data.

PQ:

‘’:2

h:1 e:1

a:3

r:1

t:2

d:1 .:1I:1 EOF:1

73 104 101 114 100 46 0

Page 35: HUFFMAN CODING 2

35

I heart data.

PQ:

‘’:2

h:1 e:1

a:3

r:1

t:2

d:1 .:1I:1

EOF:1

73 104 101 114 100 46

Page 36: HUFFMAN CODING 2

36

I heart data.

PQ:

‘’:2

h:1 e:1

a:3

r:1

t:2

d:1

.:1

I:1

EOF:1

73 104 101 114 100

Page 37: HUFFMAN CODING 2

37

I heart data.

PQ:

‘’:2

h:1 e:1

a:3

r:1

t:2

d:1

.:1 I:1EOF:1

104 101 114 100

Page 38: HUFFMAN CODING 2

38

I heart data.

PQ:

‘’:2

h:1 e:1

a:3

r:1

t:2

d:1.:1 I:1EOF:1

104 101 114

Page 39: HUFFMAN CODING 2

39

I heart data.

PQ:

‘’:2

h:1

e:1

a:3

r:1

t:2

d:1.:1 I:1EOF:1

104 114

Page 40: HUFFMAN CODING 2

40

I heart data.

PQ:

‘’:2

h:1e:1

a:3

r:1

t:2

d:1.:1 I:1EOF:1

114

Page 41: HUFFMAN CODING 2

41

I heart data.

PQ:

‘’:2

h:1e:1

a:3

r:1

t:2

d:1.:1 I:1EOF:1

Page 42: HUFFMAN CODING 2

42

I heart data.

PQ:

‘’:2

h:1e:1

a:3

r:1

t:2

d:1.:1 I:1EOF:1

32 116

Page 43: HUFFMAN CODING 2

43

I heart data.

PQ: ‘’:2h:1e:1

a:3

r:1

t:2

d:1.:1 I:1EOF:1

116

Page 44: HUFFMAN CODING 2

44

I heart data.

PQ: ‘’:2h:1e:1

a:3

r:1 t:2d:1.:1 I:1EOF:1

Page 45: HUFFMAN CODING 2

45

I heart data.

PQ: ‘’:2h:1e:1 a:3r:1 t:2d:1.:1 I:1EOF:1

Page 46: HUFFMAN CODING 2

46

1.count occurrences of each character in a string

2. for each character, create a leaf node to store the character and count (ie. weight)

3.place each leaf node into a priority queue

4.construct the binary trie

5.write header with binary trie information

6.compress string using character codes

Page 47: HUFFMAN CODING 2

47

-merge the two lowest weight trees together-make a new parent node with their combined weight -smaller node on the left, larger on the right

-reinsert new tree back into the queue

-but, what about ties?-when trees have more than one node, break the tie with the ASCII value of the leftmost character

Page 48: HUFFMAN CODING 2

48

I heart data.

PQ: ‘’:2h:1e:1 a:3r:1 t:2d:1.:1 I:1EOF:1

where are the two lowest weight trees?

Page 49: HUFFMAN CODING 2

49

I heart data.

PQ: ‘’:2h:1e:1 a:3r:1 t:2d:1.:1 I:1EOF:1

where are the two lowest weight trees?

Page 50: HUFFMAN CODING 2

50

I heart data.

PQ: ‘’:2h:1e:1 a:3r:1 t:2d:1I:1

.:1EOF:1

2

Page 51: HUFFMAN CODING 2

51

I heart data.

PQ: ‘’:2h:1e:1 a:3r:1 t:2d:1I:1

.:1EOF:1

2

where do we insert this new tree?

Page 52: HUFFMAN CODING 2

52

I heart data.

PQ: ‘’:2h:1e:1 a:3r:1 t:2d:1I:1

.:1EOF:1

2

where do we insert this new tree?

32 116

0

Page 53: HUFFMAN CODING 2

53

I heart data.

PQ: ‘’:2h:1e:1 a:3r:1 t:2d:1I:1

.:1EOF:1

2

Page 54: HUFFMAN CODING 2

54

I heart data.

PQ: ‘’:2h:1e:1 a:3r:1 t:2d:1I:1

.:1EOF:1

2

AND…repeat…

Page 55: HUFFMAN CODING 2

55

I heart data.

PQ: ‘’:2h:1e:1 a:3r:1 t:2

d:1I:1

.:1EOF:1

2

2

Page 56: HUFFMAN CODING 2

56

I heart data.

PQ: ‘’:2h:1e:1 a:3r:1 t:2

.:1EOF:1

2

d:1I:1

2

Page 57: HUFFMAN CODING 2

h:1e:1

2

57

I heart data.

PQ: ‘’:2 a:3r:1 t:2

.:1EOF:1

2

d:1I:1

2

Page 58: HUFFMAN CODING 2

h:1e:1

2

58

I heart data.

PQ: ‘’:2 a:3r:1 t:2

.:1EOF:1

2

d:1I:1

2

Page 59: HUFFMAN CODING 2

h:1e:1

2

59

I heart data.

PQ: ‘’:2 a:3

r:1

t:2

.:1EOF:1

2

d:1I:1

2

Page 60: HUFFMAN CODING 2

60

I heart data.

PQ:

.:1EOF:1

2 r:1

3

h:1e:1

2‘’:2 a:3t:2

d:1I:1

2

Page 61: HUFFMAN CODING 2

61

I heart data.

PQ:

.:1EOF:1

2 r:1

3

h:1e:1

2‘’:2 a:3t:2

d:1I:1

2

Page 62: HUFFMAN CODING 2

d:1I:1

2‘’:2

62

I heart data.

PQ:

.:1EOF:1

2 r:1

3

h:1e:1

2 a:3t:2

Page 63: HUFFMAN CODING 2

d:1I:1

2

4

‘’:2

63

I heart data.

PQ:

.:1EOF:1

2 r:1

3

h:1e:1

2 t:2 a:3

Page 64: HUFFMAN CODING 2

d:1I:1

2

4

‘’:2

64

I heart data.

PQ:

.:1EOF:1

2 r:1

3

h:1e:1

2 a:3t:2

Page 65: HUFFMAN CODING 2

d:1I:1

2

4

‘’:2

65

I heart data.

PQ:

.:1EOF:1

2 r:1

3 a:3

h:1e:1

2 t:2

4

Page 66: HUFFMAN CODING 2

d:1I:1

2

4

‘’:2

66

I heart data.

PQ:

.:1EOF:1

2 r:1

3 a:3

h:1e:1

2 t:2

4

Page 67: HUFFMAN CODING 2

d:1I:1

2

4

‘’:2

67

I heart data.

PQ:

h:1e:1

2 t:2

4

.:1EOF:1

2 r:1

3 a:3

6

Page 68: HUFFMAN CODING 2

d:1I:1

2

4

‘’:2

68

I heart data.

PQ:

h:1e:1

2 t:2

4

.:1EOF:1

2 r:1

3 a:3

6

Page 69: HUFFMAN CODING 2

d:1I:1

2

4

‘’:2

69

I heart data.

PQ:

h:1e:1

2 t:2

4

.:1EOF:1

2 r:1

3 a:3

6

8

Page 70: HUFFMAN CODING 2

70

I heart data.

PQ:

.:1EOF:1

2 r:1

3 a:3

6

d:1I:1

2

4

‘’:2

h:1e:1

2 t:2

4

8

Page 71: HUFFMAN CODING 2

71

I heart data.

.:1EOF:1

2 r:1

3 a:3

6

d:1I:1

2

4

‘’:2

h:1e:1

2 t:2

4

8

14

Page 72: HUFFMAN CODING 2

72

while PQ contains at least two nodes { dequeue two trees, compare leftmost nodes to determine left and right nodes, L and R

node I = new internal node I.left = L I.right = R I.weight = L.weight + R.weight

PQ.enqueue(I) }

//PQ contains only one node root = PQ.dequeue()

Page 73: HUFFMAN CODING 2

73

1.count occurrences of each character in a string

2. for each character, create a leaf node to store the character and count (ie. weight)

3.place each leaf node into a priority queue

4.construct the binary trie

5.write header with binary trie information

6.compress string using character codes

Page 74: HUFFMAN CODING 2

74

-the compressed files needs info to be able to rebuild the same compression tree

-we call this the header

-what information do we need to store to reconstruct the compression tree?

Page 75: HUFFMAN CODING 2

75

1. store character frequencies-for every character -for only non-zero characters… must store pair

2. use a standardized letter frequency-use the same standard for compression and decompression

3. store the tree using a pre-order traversal-more complex, but smallest

Page 76: HUFFMAN CODING 2

76

-storing character frequencies-write each character, followed by the frequency -characters are 1 byte, integers are 4 bytes -so, each character will require 5 bytes to write

-how do we mark the end of the header (and beginning of the compressed string)?

-write out null and 0

-remember, we are going to write out our file as hex

Page 77: HUFFMAN CODING 2

77

character hex value frequency

I 49 1

‘’ 20 2

h 68 1

e 65 1

a 61 3

r 72 1

t 74 2

d 64 1

. 2E 1

EOF 00 1

I heart data.

49 00 00 00 01 20 00 00 00 02 68 00 00 00 01 65 00 00 00 01 61 00 00 00 03 72 00 00 00 01 74 00 00 00 02 64 00 00 00 01 2E 00 00 00 01 00 00 00 00 01 00 00 00 00 00

Page 78: HUFFMAN CODING 2

78

1.count occurrences of each character in a string

2. for each character, create a leaf node to store the character and count (ie. weight)

3.place each leaf node into a priority queue

4.construct the binary trie

5.write header with binary trie information

6.compress string using character codes

Page 79: HUFFMAN CODING 2

79

I heart data.

.:1EOF:1

2 r:1

3 a:3

6

d:1I:1

2

4

‘’:2

h:1e:1

2 t:2

4

8

14

Page 80: HUFFMAN CODING 2

80

-want to create a look-up table to quickly convert from character to bit code

-what data structure can we use for this?

Page 81: HUFFMAN CODING 2

81

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

I heart data.

Page 82: HUFFMAN CODING 2

82

I heart data.

1010

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

Page 83: HUFFMAN CODING 2

83

I heart data.

1010 100

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

Page 84: HUFFMAN CODING 2

84

I heart data.

1010 1001 101

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

Page 85: HUFFMAN CODING 2

85

I heart data.

1010 1001 1011 100

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

Page 86: HUFFMAN CODING 2

86

I heart data.

1010 1001 1011 1000 1

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

Page 87: HUFFMAN CODING 2

87

I heart data.

1010 1001 1011 1000 1001

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

Page 88: HUFFMAN CODING 2

88

I heart data.

1010 1001 1011 1000 1001 1111 0010 1101 1110 1000 1

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

Page 89: HUFFMAN CODING 2

89

I heart data.

1010 1001 1011 1000 1001 1111 0010 1101 1110 1000 1

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

now what?

Page 90: HUFFMAN CODING 2

90

I heart data.

1010 1001 1011 1000 1001 1111 0010 1101 1110 1000 1000 0

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

Page 91: HUFFMAN CODING 2

91

I heart data.

1010 1001 1011 1000 1001 1111 0010 1101 1110 1000 1000 0

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

Are we donE?

Page 92: HUFFMAN CODING 2

92

I heart data.

1010 1001 1011 1000 1001 1111 0010 1101 1110 1000 1000 0000

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

Page 93: HUFFMAN CODING 2

93

I heart data.

1010 1001 1011 1000 1001 1111 0010 1101 1110 1000 1000 0000

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

now, convert to hex

Page 94: HUFFMAN CODING 2

94

I heart data.

1010 1001 1011 1000 1001 1111 0010 1101 1110 1000 1000 0000

character hex value bit-code

I 49 1010

‘’ 20 100

h 68 1101

e 65 1100

a 61 01

r 72 001

t 74 111

d 64 1011

. 2E 0001

EOF 00 0000

now, convert to hex

A9 B8 9F 2D E8 80

Page 95: HUFFMAN CODING 2

95

character hex value bit-code

I 49 0110

‘’ 20 010

h 68 1101

e 65 1100

a 61 10

r 72 001

t 74 111

d 64 0111

. 2E 0001

EOF 00 0000

I heart data.

49 00 00 00 01 20 00 00 00 02 68 00 00 00 01 65 00 00 00 01 61 00 00 00 03 72 00 00 00 01 74 00 00 00 02 64 00 00 00 01 2E 00 00 00 01 00 00 00 00 01 00 00 00 00 00 A9 B8 9F 2D E8 80

Page 96: HUFFMAN CODING 2

96

character hex value bit-code

I 49 0110

‘’ 20 010

h 68 1101

e 65 1100

a 61 10

r 72 001

t 74 111

d 64 0111

. 2E 0001

EOF 00 0000

I heart data.

49 00 00 00 01 20 00 00 00 02 68 00 00 00 01 65 00 00 00 01 61 00 00 00 03 72 00 00 00 01 74 00 00 00 02 64 00 00 00 01 2E 00 00 00 01 00 00 00 00 01 00 00 00 00 00 A9 B8 9F 2D E8 80

we just went from 13 bytes to 61 bytes.

why should we care about this????

Page 97: HUFFMAN CODING 2

decompression

97

Page 98: HUFFMAN CODING 2

98

what steps do I need to take to decompress this file?

49 00 00 00 01 20 00 00 00 02 68 00 00 00 01 65 00 00 00 01 61 00 00 00 03 72 00 00 00 01 74 00 00 00 02 64 00 00 00 01 2E 00 00 00 01 00 00 00 00 01 00 00 00 00 00 A9 B8 9F 2D E8 80

Page 99: HUFFMAN CODING 2

next time…

99

Page 100: HUFFMAN CODING 2

100

Page 101: HUFFMAN CODING 2

101

-homework-assignment 12 due Tuesday night