Download - Huffman Coding - Colorado State Universityanderson/cs320/slides/Huffman.pdf · Huffman Coding How can we represent a string of characters with a minimum number of bits? Answered by

Transcript

Huffman CodingHow can we represent a string of

characters with a minimum number

of bits ?

Answered by David Huffman in 1952

when he was a graduate student at MIT .

• Eachcharacter represented by a bitstring code

.

• Codes areof various lengths .

• Codes are prefix codes ( prefix - free ).

No code is

prefixof another

÷¥÷i¥÷xa¥÷Y_bz Prefix cod

How would you represent a string of

Characters 6,

C,

T,

and A ?

Assign shortest codes to most frequentcharacters

.Let's

say frequencies are

G C T A-

- - -

O . 4 0.3 0.2 0.1

One prefix code is

G C T A-

- --

O to 110 I I I

The stringC G TT A CC AT AA

is10 O 110 110 Ill to LO Ill 110 Ill Ill

Canyou decode this ? This binary tree

Ohelps :

ToY

④ to%

to to

How many bits are needed for tooo GST , As ?

G C T A-

- - -

0.4 0.3 0.2 0.1

CodeG : 0.4 .1000 = 400 o

C

:0.3 - 1000 = 300 10

T:

0.2 . 1000 = 200 110

A : Oil . 1000 = COO Ill

Number of bits = 400.1 t 300.2 t 200.3+1003

= 4001-600 t 600 t 300

= 1,900 bits

IWhat if we used a fixed-length code ?Not

Four characters only needs 2 bit each much

G C T A less- - - -

00 01 LO 11

4000 characters = 2. ooo bits)

what if frequencies were

G C T A- - - -

O . 8 O. I 0.05 0.05

O LO I I O C I I

1,000 characters = 800 . I t 100.2+50-3+50.3= 800 t 200 t 150 t 150

= 1,300

fixed length still 2,000

Savings increases with number of characters.

How to construct a Huffman code ?or

How to construct that decoding binary tree ?

Start with a leaf for each character withits frequency .

G C T A-

-- -

O . 4 0.3 0.2 O. I

Merge two nodes with lowest frequency .

Sum their frequencies .

G C

074Is0.3

I IT ATo

Repeat G-0.4 0.6

I \c

OJ 0.3

I IT A

- -

0.2 O. I

1.0

Repeat again GI I

074 0.6

I \C-

0.30.3

I IT A

- -

and assign O to deff0.2 0.1

subtrees and I to right ones.

1.0

4

YG

074 0.6

OOfYC-

0.3O . 3

10

01IT A

- -

0.2 O. I

110Ill

Pseudo code from page 431.

Assume Q is a min - priority queue

of objects containing a frequency and,

if a

leaf,

the character.

In the queue,

the order

is determined by the frequency .

Let C . be the initial objects containingeach character and its frequency .

Each object has attributes . freer. lefta right .

n = ICI

Q = C

for i = I to n - I

2 = new object2. left = x = pop C Q )

2. right =

y =

pop CQ )

Z. frey = X. frog t y . freq

pushCQ,

z )return pop CQ ) # Top node

To encode a string of characters,

a dictionary of Huffman codes,

indexed ( keyed ) by characters,

would

be useful.

How can you construct this dictionaryfrom the tree ?

1.0

4

YG

074 0.6

OOfYC-

0.3O . 3

10

01IT A

- -

0.2 O. I

110Ill

Now,

how would you use this

dictionary to encode a text

String ?,

gcc AT Gj ⇒ 010104

!" 0010

And how would you decode this ?

0101.04! 110010 ⇒

'GCC AT 6C

'

1.0

4

YG

074 0.6

0OfYc

Is o . 3

10

01I

'

T A- -

0.2 O. I

110Ill

Why does the Huffman algorithm guaranteean optimal ( minimum size ) prefix - free code ?

or

Does the greedy choice of merging the two

lowest frequency characters always work ?

Lemma 16.2-

T T is an optimal tree.

×a and b are at maximum

depth .

x andy ane characters

Y with lowest frequency .

I \Swap x and a

,and

a b

yand b

.

T"

a

Cost of codes for T"

Same as cost for T ,

is I s Iimisaiatsoeea:I \x↳y

Details : Assume a. freq E b.freq

x. freq E y . freq

BCT ) is the expected cost of encodings using T .

BCT ) = Eec c. freq . of Cc )

set of 9 ( depth in Tof character c,

all characters or number of bits in c code.

Let T'

be the tree after swapping just x and a.

BCT ) - BCT' )

= Ifc c. freq . of Cc )-

€ cc -freq 'd,

! )

= X. freq . of Cx ) ta. freq . of Ca ) - x. freq . df.cc ) - a. freq . Ipcc)

= X. freq . dy Cx ) ta . freq .

of.ca ) - X. freq - afca ) - a. freq . dyed= @freq -

a. freq) dtcx ) t Ca .frey - X. frey) d,

Ca)

= ( a. freq - x. frey ) C at -

a'eCx ) )

e-ZO nonnegative because nonnegativex is minimum frequency because a is

leaf maximum depthleaf int

swap X and a

f Then swap

So BCT ) - BLT' ) Zo

.

Yana bfWe can show in the same way that BCT ' ) - BCT " )zo

.

So BCT ) - BCT " ) ZO or BCT ) z BCT " ).

But, since we assumed Tis optimal ,

BCT ) s BCT " I.

So,

BCT ) = BCT " ) !

We just showed thata . .

• . .

T "

,formed by merging lowest frequency

characters, is an optimal tree !

Of all possible mergers at each step ,the

Huffman algorithm chooses the one that

incurs the least cost.

Now,

let's prove that the problem of constructing

optimal prefix codes has the optimal substructure

property .

Lemma 16.3.

as before,

let x andy

be the

characters with the two lowest frequencies .

We must show that if T'

is optimal for C - Ex, y 3

then the step of

replacing z with÷÷. .

. ..

. .ae.

resits in1 \

?x y an optimal

tree for C.

BCT ) = ¥ qgjfneg-d.cc) t Cx

. freq ty.fregdcd.cat I )

= BCT ' ) t x .freq t y . freqo -

BC T' ) = BCT ) - x. for -

y . far

T .,÷.g!* eraser

B ( T' ) = BCT ) - x. for -

y . farLet's assume that what we want to be

true,

thatT is optimal for C

i

,is not

true,

and see itthis results in a contradiction.

If T is not optimal for C,

then for tree T"

that is optimal for C,

BC T" ) < BCT )

.

Call T' "

the tree that we get by merging x andy

in

T"

.BLT ' " ) = BC T

" ) - x. freq -

y .freq< BCT ) -

x. fer -

y . freer= BCT ' )

So B CT' ' ' ) L B C T

' )

Contradicting our assumption thatT and T

'are optimal !