Bits and Huffman Encoding
Please get a piece of paper and a pen and put your name and netid on it. Make sure you can turn in it after class without losing
your notes.
How is data stored in a computer?
• Several different ways are possible• In an ASCII-encoded text file (hithero known
as a “text file”), each character consists of 8-bit chunks
• What’s a bit?
Bits Mean What You Want Them To Mean
01001101 01001001 01001011 01000101
M I K E
But why 8-bit chunks? Couldn’t we interpret the same string as 4 bit chunks?
0100 1101 0100 1001 0100 1011 0100 01014 13 4 9 4 11 4 5
With 8 bit chucks there are 28 (256) possible characters. With 4 bit chunks there are 24 (16) possible characters.
What if we want to save memory?
• Could just use fewer bits per chunk, but that limits the number of characters we can represent
• But my document contains a lot more ‘a’s than ‘%’s. What if the code for a was “01” and the code for % was “10000101”
A ProblemMike decides he wants to compress his vast collection of Harry Potter
fanfic (all stored a textfiles). He notices that the character ‘H’ occurs much more frequently in the textfiles than does the character ‘%’, but they both use 8 bits. So he changes the encoding so that ‘H’ is represented by ‘01’ (2 bits) and ‘%’ is represented by ‘1010101010101010’ (16 bits). Assuming all other characters stay at 8 bits (a bit unlikely…but just pretend)…how much more frequent do ‘H’s need to be than ‘%’s for there to be a space savings?
1. There must be more than 3Hs for every %s2. There must be more than 3Hs for every 2%s3. There must be more than 4Hs for every 3%s4. As long as there more Hs than %s you get a savings
• We can get savings if we know what characters occur frequently
• So imagine you look at some set of data with character frequencies. How much can you save?
Consider the string:
• AACCCAAABAADAE
Letter Frequency
A 8
C 3
B 1
D 1
E 1
What’s the best encoding?
Why will this encoding not work?
Letter Frequency Encoding
A 8 0
C 3 1
B 1 10
D 1 01
E 1 11
What is the encoding for AAE?What is the encoding for ADC?
Consider the string:
• AACCCAAABAADAEThe rule: one character’s encoding cannot be the prefix of another. So if A=01, no other encoding can begin with 01
Letter Frequency Encoding
A 8 ?
C 3 ?
B 1 ?
D 1 ?
E 1 ?
Work with your row to find the most efficient possible encoding for each character. I was able to encode the entire string in 25 bits. See if you can do it that efficiently – but be careful with prefixes. Write your encoding, plus your computation of its total length, on your handin sheet.
My Encoding
• AACCCAAABAADAEThe rule: one character’s encoding cannot be the prefix of another. So if A=01, no other encoding can begin with 01
Letter Frequency Encoding
A 8 0
C 3 11
B 1 100
D 1 1010
E 1 1011
There a several possible encodings that can give you 25. I’m pretty sure there’s no more efficient encoding, at least as far as compressing individual characters.
Given a set of frequencies, how should you select the character encoding for
maximum efficiency?
Huffman Encoding
The basic idea
• We will make a “huffman tree”. By repeatedly combining nodes with small frequencies into “meta nodes” with larger frequency
• A letter’s position in the tree will tell us what its encoded form is
Generate a huffman tree and encodings for this circumstance
Follow the tutorial linked off the resources section in Sakai
Letter2 Frequency Encoding
A 12 ?
C 9 ?
B 8 ?
D 8 ?
E 5 ?
F 5 ?
Write the huffman tree and the resultant encoding on your handin sheet.
The Header
• The Magic Number• Counts• Or a tree for extra credit
The Psuedo-EOF character
• A made up character to write to your file
Please turn your sheet in at the back of the room (try for a neat pile)
Top Related