Topic 20: Huffman Coding

77
Topic 20: Huffman Coding The author should gaze at Noah, and ... learn, as they did in the Ark, to crowd a great deal of matter into a very small compass. Sydney Smith, Edinburgh Review

description

Topic 20: Huffman Coding. The author should gaze at Noah, and ... learn, as they did in the Ark, to crowd a great deal of matter into a very small compass. Sydney Smith, Edinburgh Review. Compression. Storage of digital information measured in bits and bytes. - PowerPoint PPT Presentation

Transcript of Topic 20: Huffman Coding

Page 1: Topic 20: Huffman Coding

Topic 20: Huffman Coding

The author should gaze at Noah, and ... learn, as they did in the Ark, to crowd a great deal of matter into a very small compass.

Sydney Smith, Edinburgh Review

Page 2: Topic 20: Huffman Coding

Agenda

· Encoding· Compression· Huffman Coding

2

Page 3: Topic 20: Huffman Coding

Encoding

· UT CS· 85 84 32 67 83· 01010101 01010100 00100000 01000011

01010011

· what is a file? how do some OS use file extensions?· open a bitmap in a text editor· open a pdf in word

3

Page 4: Topic 20: Huffman Coding

ASCII - UNICODE

4

Page 5: Topic 20: Huffman Coding

Text File

5

Page 6: Topic 20: Huffman Coding

Text File???

6

Page 7: Topic 20: Huffman Coding

Bitmap File

7

Page 8: Topic 20: Huffman Coding

Bitmap File????

8

Page 9: Topic 20: Huffman Coding

JPEG File

9

Page 10: Topic 20: Huffman Coding

JPEG VS BITMAP

· JPEG File

10

Page 11: Topic 20: Huffman Coding

Encoding Schemes

· "It's all 1s and 0s"· What do the 1s and 0s mean?· 50 12 109· ASCII -> 2ym· Red Blue Green ->

dark teal?

11

Page 12: Topic 20: Huffman Coding

Agenda

· Encoding· Compression· Huffman Coding

12

Page 13: Topic 20: Huffman Coding

Compression

· Compression: Storing the same information but in a form that takes less memory

· lossless and lossy compression· Recall:

13

Page 14: Topic 20: Huffman Coding

Lossy Artifacts

14

Page 15: Topic 20: Huffman Coding

Why Bother?

· Is compression really necessary?

2 Terabytes500 HD, 2 hour movies or 500,000 songs

15

Page 16: Topic 20: Huffman Coding

Little Pipes and Big Pumps

Home Internet Access· 40 Mbps roughly $40 per

month.· 12 months * 3 years * $40

= $1,440· 10,000,000 bits /second

= 1.25 * 106 bytes / sec

CPU Capability· $1,500 for a laptop or

desktop· Intel i7 processor· Assume it lasts 3 years.· Memory bandwidth

25.6 GB / sec= 2.6 * 1010 bytes / sec

· on the order of 5.0 * 1010 instructions / second

16

Page 17: Topic 20: Huffman Coding

Mobile Devices?

Cellular Network· Your mileage may vary …· Mega bits per second· AT&T

17 download, 7 upload

· T-Mobile & Verizon12 download, 7 upload

· 17,000,000 bits per second = 2.125 x 106 bytes per second

http://tinyurl.com/q6o7wan

iPhone CPU· Apple A6 System on a

Chip· Coy about IPS· 2 cores· Rough estimates:

1 x 1010 instructions per second

17

Page 18: Topic 20: Huffman Coding

Little Pipes and Big Pumps

Data In From Network

CPU

18

Page 19: Topic 20: Huffman Coding

Compression - Why Bother?

19

· Apostolos "Toli" Lerios· Facebook Engineer· Heads image storage group· jpeg images already

compressed· look for ways to compress even

more· 1% less space = millions of

dollars in savings

Page 20: Topic 20: Huffman Coding

Agenda

· Encoding· Compression· Huffman Coding

20

Page 21: Topic 20: Huffman Coding

21

Purpose of Huffman Coding

· Proposed by Dr. David A. Huffman – A Method for the Construction of Minimum

Redundancy Codes– Written in 1952

· Applicable to many forms of data transmission– Our example: text files– still used in fax machines, mp3 encoding, others

Page 22: Topic 20: Huffman Coding

22

The Basic Algorithm

· Huffman coding is a form of statistical coding· Not all characters occur with the same

frequency!· Yet in ASCII all characters are allocated the

same amount of space

– 1 char = 1 byte, be it e or x

Page 23: Topic 20: Huffman Coding

23

The Basic Algorithm

· Any savings in tailoring codes to frequency of character?

· Code word lengths are no longer fixed like ASCII or Unicode

· Code word lengths vary and will be shorter for the more frequently used characters

Page 24: Topic 20: Huffman Coding

24

The Basic Algorithm

1. Scan text to be compressed and tally occurrence of all characters.

2. Sort or prioritize characters based on number of occurrences in text.

3. Build Huffman code tree based on prioritized list.

4. Perform a traversal of tree to determine all code words.

5. Scan text again and create new file using the Huffman codes

Page 25: Topic 20: Huffman Coding

25

Building a TreeScan the original text

· Consider the following short text

Eerie eyes seen near lake.

· Count up the occurrences of all characters in the text

Page 26: Topic 20: Huffman Coding

26

Building a TreeScan the original text

Eerie eyes seen near lake.· What characters are present?

E e r i space y s n a r l k .

Page 27: Topic 20: Huffman Coding

27

Building a TreeScan the original text

Eerie eyes seen near lake.· What is the frequency of each character in the

text?

Char Freq. Char Freq. Char Freq. E 1 y 1 k 1 e 8 s 2 . 1 r 2 n 2 i 1 a 2 space 4 l 1

Page 28: Topic 20: Huffman Coding

28

Building a TreePrioritize characters

· Create binary tree nodes with character and frequency of each character

· Place nodes in a priority queue– The lower the occurrence, the higher the

priority in the queue

Page 29: Topic 20: Huffman Coding

29

· The queue after inserting all nodes

· Null Pointers are not shown

Building a Tree

E

1

i

1

y

1

l

1

k

1

.

1

r

2

s

2

n

2

a

2

sp

4

e

8

Page 30: Topic 20: Huffman Coding

30

Building a Tree

· While priority queue contains two or more nodes– Create new node– Dequeue node and make it left subtree– Dequeue next node and make it right subtree– Frequency of new node equals sum of frequency of

left and right children– Enqueue new node back into queue

Page 31: Topic 20: Huffman Coding

31

Building a Tree

E

1

i

1

y

1

l

1

k

1

.

1

r

2

s

2

n

2

a

2

sp

4

e

8

Page 32: Topic 20: Huffman Coding

32

Building a Tree

E

1

i

1

2

y

1

l

1

k

1

.

1

r

2

s

2

n

2

a

2

sp

4

e

8

Page 33: Topic 20: Huffman Coding

33

Building a Tree

E

1

i

1

k

1

l

1

y

1

.

1

a

2

n

2

r

2

s

2

sp

4

e

82

Page 34: Topic 20: Huffman Coding

34

Building a Tree

E

1

i

1

y

1

.

1

a

2

n

2

r

2

s

2

sp

4

e

82

k

1

l

1

2

Page 35: Topic 20: Huffman Coding

35

Building a Tree

E

1

i

1

y

1

.

1

a

2

n

2

r

2

s

2

sp

4

e

8

2

k

1

l

1

2

Page 36: Topic 20: Huffman Coding

36

Building a Tree

E

1

i

1

a

2

n

2

r

2

s

2

sp

4

e

8

2

k

1

l

1

2

y

1

.

1

2

Page 37: Topic 20: Huffman Coding

37

Building a Tree

E

1

i

1

a

2

n

2

r

2

s

2

sp

4

e

8

2

k

1

l

1

2

y

1

.

1

2

Page 38: Topic 20: Huffman Coding

38

Building a Tree

E

1

i

1

r

2

s

2sp

4

e

8

2

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

Page 39: Topic 20: Huffman Coding

39

Building a Tree

E

1

i

1

r

2

s

2sp

4

e

8

2

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

Page 40: Topic 20: Huffman Coding

40

Building a Tree

E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

Page 41: Topic 20: Huffman Coding

41

Building a Tree

E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

Page 42: Topic 20: Huffman Coding

42

Building a Tree

E

1

i

1

sp

4

e

8

2

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

4

Page 43: Topic 20: Huffman Coding

43

Building a Tree

E

1

i

1

sp

4

e

82

k

1

l

1

2y

1

.

1

2

a

2

n

2

4

r

2

s

2

4 4

Page 44: Topic 20: Huffman Coding

44

Building a Tree

E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4 4

6

Page 45: Topic 20: Huffman Coding

45

Building a Tree

E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2a

2

n

2

4

r

2

s

2

4 4 6

What is happening to the characters with a low number of occurrences?

Page 46: Topic 20: Huffman Coding

46

Building a Tree

E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

4 6

8

Page 47: Topic 20: Huffman Coding

47

Building a Tree

E

1

i

1

sp

4

e

82

k

1

l

1

2

r

1

.

1

2

a

2

n

2

4

r

2

s

2

4

4 6 8

Page 48: Topic 20: Huffman Coding

48

Building a Tree

E

1

i

1

sp

4

e

8

2

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

46

8

10

Page 49: Topic 20: Huffman Coding

49

Building a Tree

E

1

i

1

sp

4

e

8

2

k

1

l

1

2

y

1

.

1

2a

2

n

2

4

r

2

s

2

4 46

8 10

Page 50: Topic 20: Huffman Coding

50

Building a Tree

E

1

i

1

sp

4

e

8

2

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

46

8

10

16

Page 51: Topic 20: Huffman Coding

51

Building a Tree

E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

46

8

10 16

Page 52: Topic 20: Huffman Coding

52

Building a Tree

E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

46 8

1016

26

Page 53: Topic 20: Huffman Coding

53

Building a Tree

E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

46 8

1016

26

• After enqueueing this node there is only one node left in priority queue.

Page 54: Topic 20: Huffman Coding

54

Building a TreeDequeue the single node left in the queue.

This tree contains the new code words for each character.

Frequency of root node should equal number of characters in text.

E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

46 8

1016

26

Eerie eyes seen near lake. 4 spaces,26 characters total

Page 55: Topic 20: Huffman Coding

55

Encoding the FileTraverse Tree for Codes

· Perform a traversal of the tree to obtain new code words

· left, append a 0 to code word· right append a 1 to code word· code word is only completed

when a leaf node is reached

E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

46 8

1016

26

Page 56: Topic 20: Huffman Coding

56

Encoding the FileTraverse Tree for Codes

Char CodeE 0000i 0001k 0010l 0011y 0100. 0101space 011e 10a 1100n 1101r 1110s 1111 E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

46 8

1016

26

Page 57: Topic 20: Huffman Coding

57

Encoding the File

· Rescan text and encode file using new code words

Eerie eyes seen near lake.

Char CodeE 0000i 0001k 0010l 0011y 0100. 0101space 011e 10a 1100n 1101r 1110s 1111

000010111000011001110010010111101111111010110101111011011001110011001111000010100101

Page 58: Topic 20: Huffman Coding

58

Encoding the FileResults

· Have we made things any better?

· 82 bits to encode the text

· ASCII would take 8 * 26 = 208 bits

000010111000011001110010010111101111111010110101111011011001110011001111000010100101

hIf modified code used 4 bits per character are needed. Total bits 4 * 26 = 104. Savings not as great.

Page 59: Topic 20: Huffman Coding

59

Decoding the File

· How does receiver know what the codes are?· Tree constructed for each text file.

– Considers frequency for each file– Big hit on compression, especially for smaller files

· Tree predetermined– based on statistical analysis of text files or file types

Page 60: Topic 20: Huffman Coding

60

Decoding the File· Once receiver has tree it

scans incoming bit stream· 0 go left· 1 go right101000100111100011111111011100001010

A. elk nay sirB. eek a snakeC. eek kin slyD. eek snarl nilE. eel a snarl

E

1

i

1

sp

4

e

82

k

1

l

1

2

y

1

.

1

2

a

2

n

2

4

r

2

s

2

4

46 8

1016

26

Page 61: Topic 20: Huffman Coding

Assignment Hints

· reading chunks not chars· header format· the pseudo eof character· the GUI

61

Page 62: Topic 20: Huffman Coding

Assignment Example

· "Eerie eyes seen near lake." will result in different codes than those shown in slides due to:– adding elements in order to PriorityQueue– required pseudo eof character (PEOF)

62

Page 63: Topic 20: Huffman Coding

Assignment Example

63

Char Freq. Char Freq. Char Freq. E 1 y 1 k 1 e 8 s 2 . 1 r 2 n 2 PEOF 1 i 1 a 2 space 4 l 1

Page 64: Topic 20: Huffman Coding

Assignment Example

64

.1

y1

E1

i1

k1

l1

PEOF1

a2

n2

r2

s2

SP4

e8

Page 65: Topic 20: Huffman Coding

Assignment Example

65

.1

y1

E1

i1

k1

l1

PEOF1

a2

n2

r2

s2

SP4

e8

2

Page 66: Topic 20: Huffman Coding

Assignment Example

66

.1

y1

E1

i1

k1

l1

PEOF1

a2

n2

r2

s2

SP4

e8

2

Page 67: Topic 20: Huffman Coding

Assignment Example

67

.1

y1

E1

i1

k1

l1

PEOF1

a2

n2

r2

s2

SP4

e8

2 2

Page 68: Topic 20: Huffman Coding

Assignment Example

68

.1

y1

E1

i1

k1

l1

PEOF1

a2

n2

r2

s2

SP4

e8

2 2 2

Page 69: Topic 20: Huffman Coding

Assignment Example

69

.1

y1

E1

i1

k1

l1

PEOF1

a2

n2

r2

s2

SP4

e8

2 2 2 3

Page 70: Topic 20: Huffman Coding

Assignment Example

70

.1

y1

E1

i1

k1

l1

PEOF1

a2

n2

r2

s2

SP4

e8

2 2 2 3 4

Page 71: Topic 20: Huffman Coding

Assignment Example

71

.1

y1

E1

i1

k1

l1

PEOF1

a2

n2

r2

s2

SP4

e8

2

2 2 3 4 4

Page 72: Topic 20: Huffman Coding

72

.1

y1

E1

i1

k1

l1 PEOF

1a2

n2

r2

s2

SP4

e8

22 2

3

4 4 4 7

Page 73: Topic 20: Huffman Coding

73

y1

i1

k1

l1

PEOF1

a2

SP4

e8

2 2 3

47

.1

E1

n2

r2

s2

2

4 4

8

Page 74: Topic 20: Huffman Coding

74

y1

i1

k1

l1 PEOF

1a2

SP4

e8

2 23

4 7

.1

E1

n2

r2

s2

2

4 4

8 11

Page 75: Topic 20: Huffman Coding

y1

i1

k1

l1 PEOF

1a2

SP4

e8

2 23

4 7

.1

E1

n2

r2

s2

2

4 4

8

11 16

75

Page 76: Topic 20: Huffman Coding

y1

i1

k1

l1 PEOF

1a2

SP4

e8

2 23

4 7

.1

E1

n2

r2

s2

2

4 4

8

11 16

27

76

Page 77: Topic 20: Huffman Coding

Codes

77

value: 32, equivalent char: , frequency: 4, new code 011value: 46, equivalent char: ., frequency: 1, new code 11110value: 69, equivalent char: E, frequency: 1, new code 11111value: 97, equivalent char: a, frequency: 2, new code 0101value: 101, equivalent char: e, frequency: 8, new code 10value: 105, equivalent char: i, frequency: 1, new code 0000value: 107, equivalent char: k, frequency: 1, new code 0001value: 108, equivalent char: l, frequency: 1, new code 0010value: 110, equivalent char: n, frequency: 2, new code 1100value: 114, equivalent char: r, frequency: 2, new code 1101value: 115, equivalent char: s, frequency: 2, new code 1110value: 121, equivalent char: y, frequency: 1, new code 0011value: 256, equivalent char: ?, frequency: 1, new code 0100