Data Compression Huffman Codes

download Data Compression Huffman Codes

of 60

Transcript of Data Compression Huffman Codes

  • 8/7/2019 Data Compression Huffman Codes

    1/60

    Huffman Codes

    Prof. Ja-Ling Wu

    Department of Computer Scienceand Information EngineeringNational Taiwan University

  • 8/7/2019 Data Compression Huffman Codes

    2/60

    2

    Huffman Encoding

    1. Order the symbols according to their probabilities

    alphabet set : S1, S2, , SNprob. of occurrence : P1, P2, , PN

    the symbols are rearranged so that

    P1 P2 PN

    2. Apply a contraction process to the two symbols with thesmallest probabilities

    replace symbols SN-1 and SN by a hypothetical symbol, sayHN-1, that has a prob. of occurrence PN-1+PNthe new set of symbols has N-1 members:

    S1, S2, , SN-2, HN-13. Repeat the step 2 until the final set has only one member.

  • 8/7/2019 Data Compression Huffman Codes

    3/60

    3

    The recursive procedure in step 2 can be viewed as

    the construction of a binary tree, since at each step weare merging two symbols.

    At the end of the recursion, all the symbols S1, S2, ,

    SN will be leaf nodes of the tree.

    The codeword for each symbol Si is obtained by

    traversing the binary tree from its root to the leaf

    node corresponding to Si

  • 8/7/2019 Data Compression Huffman Codes

    4/60

    4

    root

    k e

    f

    f

    b

    b

    1 e

    ec

    c

    g

    g

    a

    a

    d

    d

    h

    1 i

    j 0

    0

    0

    0

    0

    0

    1

    1

    1

    1

    C(w)

    a 10101

    b 01

    c 100

    d 10100

    e 11

    f 00

    g 1011

    0.05d1.0

    0.1h0.05a2.0

    0.1c0.1g0.1g3.0

    0.2f0.2i0.1c0.1c05.0

    0.3j0.2b0.2f0.2f0.2f1.0

    0.4k0.3e0.3j0.2b0.2b0.2b2.0

    0.60.4k0.3e0.3e0.3e0.3e05.0

    222221 54321

    g

    f

    e

    d

    c

    b

    la

    sps

    StepStepStepStepStepStep

    ii

    0.1

    h

    0.2

    i

    0.3

    j

    0.4

    k

    0.6

    e

    1.0

    root

  • 8/7/2019 Data Compression Huffman Codes

    5/60

    5

    Average codeword length

    lave is a measure of the compression ratio.

    In the above example,

    7 symbols 3 bits fixed length code representation

    lave(Huffman) = 2.6 bits

    Compression ration = 3/2.6 = 1.15

    i

    iipllave

  • 8/7/2019 Data Compression Huffman Codes

    6/60

    6

    Properties of Huffman codes

    Fixed-length symbols variable-length codewords :

    error propagation

    H(s) lave < H(s)+1

    H(s) lave < P+0.086

    where P is the prob. of the most frequently occurringsymbol. The equality is achieved when all symbolprobs. are inverse powers of two.

    The Huffman code-tree can be constructed both by bottom-up method in the above example

    top-down method

  • 8/7/2019 Data Compression Huffman Codes

    7/60

  • 8/7/2019 Data Compression Huffman Codes

    8/60

    8

    Huffman Decoding

    Bit-Serial Decoding : fixed input bit rate variableoutput symbol rate

    (Assume the binary coding tree is available to the decoder)In practice, Huffman coding tree can be reconstructed from thesymbol-to-codeword mapping table that is known to both theencoder and the decoder

    Step 1:

    Read the input compressed stream bit-by-bit and traverse thetree until a leaf node is reached.

    Step 2:

    As each bit in the input stream is used, it is discarded. When theleaf node is reached, the Huffman decoder outputs the symbol at

    the leaf node. This completes the decoding for this symbol.Step 3:

    Repeat steps 1 and 2 until all the input is consumed.

    Since the codeword is variable in length, the decoding bit rate isnot the same for all symbols.

  • 8/7/2019 Data Compression Huffman Codes

    9/60

    9

    Lookup-table-Based Decoding : constant decodingrate for all symbols variable input rate

    The look-up table is constructed at the decoder from the symbol-

    to-codeword mapping table. If the longest codeword in this tableis L bits, then a 2L entry lookup table is needed. : space constraintsimage/video longest L = 16 to 20.

    Look up table construction: Let Ci be the codeword that corresponds to symbol Si.Assume

    Ci has li bits. We form an L-bit address in which the first li bitsare Ci and the remaining L- li bits take on all possiblecombinations of 0 and 1. Thus, for the symbol si, there will

    be 2L- li addresses. At each entry we form the two-tuple (si, li).

  • 8/7/2019 Data Compression Huffman Codes

    10/60

    10

    Decoding Processes:

    1. From the compressed input bit stream, we read in L bits intoa buffer.

    2. We use the L-bit word in the buffer as an address into thelookup table and obtain the corresponding symbol, say sk.Let the codeword length be lk. We have now decode onesymbol.

    3. We discard the first lk bits from the buffer and we append tothe buffer, the next lk bits from the input, so that the bufferhas again L bits.

    4. Repeat steps 2 and 3 until all of the symbols have been

    decoded.

  • 8/7/2019 Data Compression Huffman Codes

    11/60

    11

    Memory Efficient and High-SpeedSearch Huffman Coding, by R. Hashemian IEEEtrans. Communications, Oct. 1995, pp. 2576-2581

    Due to variable-length coding, the Huffman tree getsprogressively sparse as it grows from the root

    Waste of memory space

    A lengthy search procedure for locating a symbol

    Ex: if K-bit is the longest Huffman code assigned to a setof symbols, the memory size for the symbols mayeasily reach 2K words in size.

    It is desirable to reduce the memory size from typical

    value of 2K, to a size proportional to the number of theactual symbols.

    Reduce memory size

    Quicker access

  • 8/7/2019 Data Compression Huffman Codes

    12/60

    12

    Ordering and clustering based Huffman Coding

    groups the codewords (tree nodes) within specified codeword lengths Characteristics of the proposed coding scheme:

    1. The search time for more frequent symbols (shorter codes) issubstantially reduced compare to less frequent symbols,resulting in an overall faster response.

    2. For long codewords the search for the symbol is also speedup. This is achieved through a specific partitioning techniquethat groups the code bits in a codeword, and the search for asymbol is conducted by jumping over the groups of bitsrather than going through the bit individually.

    3. The growth of the Huffman tree is directed toward one side ofthe tree.

    Single side growing Huffman tree (SGH-tree)

  • 8/7/2019 Data Compression Huffman Codes

    13/60

    13

    Ex: H=(S, P) S={S1, S2,, Sn}P={P1, P2,, Pn}

    No. of occurrence

    For a given source listing H, the table of codeword lengthuniquely groups the symbols into blocks, where each block isspecified by its codeword length (CL).

    TABLE IReduction Process In The Source List

    s1 48 s1 48 s1 48 s1 48 s1 48 a5 52

    s2 31 s2 31 s2 31 s2 31 s2 31 s1 48

    s3 7 s3 7 a2 8 a3 13 a4 21

    s4 6 s4 6 s3 7 a2 8

    s5 5 s5 5 s4 6

    s6 2 a1 3

    s7 1

    Merge

    Insert (in descending order)

  • 8/7/2019 Data Compression Huffman Codes

    14/60

    14

    Each block of symbols, so defined, occupies one level in theassociated Huffman tree.

    CL: codeword length

  • 8/7/2019 Data Compression Huffman Codes

    15/60

    15

    Algorithm 1: Creating a Table of Codeword Lengths

    1. The symbols are listed with the probabilities in decendingorder (the ordering of the symbols with equal probabilities isassumed indifferent).Next, the pair of symbols at the bottom of the ordered list aremerged and as a result a new symbol a1 is created. Thesymbol a1, with probability equal to the sum of the

    probabilities of the pair, is then inserted at the proper locationin the ordered list.

    To record this operation a codeword length recording (CLR)table is created which consists of three columns:

    : Columns 1 and 2 hold the last pair of symbols before beingmerged, and column 3, initially empty, is identified as thecodeword length (CL) column (Table II).

  • 8/7/2019 Data Compression Huffman Codes

    16/60

    16

    In order to make the size of the CLR table small and the hardware

    design simpler, the new symbol a1 (in general aj) is selected such

    that its inverse represents the associated table address.

    e.g. For an 8-bit address word,

    a1 The first row in the CLR table, is given the value of 1111 1110

    a1 0000 0001

    a2 1111 11101 ( a2 0000 0010)

    The sign of a composite symbol is different from an original symbol.

    A dedicated sign (MSB) detector is all that is needed to distinguishbetween an original symbol and a composite symbol.

    )(or1 j

    aa

    Composite symbol

  • 8/7/2019 Data Compression Huffman Codes

    17/60

    17

    2. Continue applying the same procedure, developed for asingle row in step 1, and construct the entire CLR table. Note

    that table II contains both the original symbol si and thecomposite ones aj (carrying opposite signs).

    3. The third column in Table II, designated by CL, is assigned tohold the codeword lengths. To fill up this column we startfrom the last row in the CLR and enter 1. This designates the

    codeword length for both s1 and a5.Next, we check for the signs of each si and a5; if positive(MSB = 0) we skip, otherwise, the symbol is a composite one(a5)and its binary inverse (a5=0000101)is a row address for table II. We now increment the number

    in the CL column, and assign a new value (2 in this example)to the CL column in row aj (5 in this example), and proceedapplying the same operation to other rows in the CLR table,as we move to the top, until the CL column is completelyfilled.

  • 8/7/2019 Data Compression Huffman Codes

    18/60

    18

    1

    2

    4

    5

    6

    S7 S6

    S5

    S3S4

    a2

    a4

    S1

    a1

    a3

    a5

    S2

    3

    2

    1

    4

    4

    5

    CLSi Si-1Row No

    1+1

    3+13+1

    2+1

    4+1

    3

  • 8/7/2019 Data Compression Huffman Codes

    19/60

    19

    role:

    (i) Si

    , Si-1

    composite, composite CL new address

    (ii)Si, Si-1 original, skip this row and CL

    4. Table II indicates that each original symbol in the table has

    its codeword length (CL) specified.Ordering the symbols according the their CL values givesTable III.

    Associated with the so-obtained TOCL one can

    actually generate a number of Huffman tables (orHuffman trees), each of which being different incodewords but identical in the codeword lengths.

  • 8/7/2019 Data Compression Huffman Codes

    20/60

    20

    Single-Side growing Huffman table (SGHT)

    Table

    Single-Side growing Huffman tree (SGH-Tree)

    Fig. 1

    are adopted.

  • 8/7/2019 Data Compression Huffman Codes

    21/60

    21

    a5

    a4

    a3 a2

    a1

    For a uniformlydistributed source

    SGH-Tree becomes

    full.

  • 8/7/2019 Data Compression Huffman Codes

    22/60

  • 8/7/2019 Data Compression Huffman Codes

    23/60

    23

    In general we can write :

    where p and q are the codeword lengths for si andsi+1, respectively, and Sn (associated with Cn)denotes the terminal symbol.

    C1 and Cn have unique forms easy to verify

    111

    and

    2*1000

    1

    1

    n

    qp

    ii

    C

    CCC

  • 8/7/2019 Data Compression Huffman Codes

    24/60

    24

    Super-tree (S-tree)

  • 8/7/2019 Data Compression Huffman Codes

    25/60

    25

    x

    x

    y

    y

    z

    z

  • 8/7/2019 Data Compression Huffman Codes

    26/60

    26

  • 8/7/2019 Data Compression Huffman Codes

    27/60

    27

    0 1 2 3 4 5 6 7 8 9 a b c d e f

    0 00 01 02 03 04 05 06 07

    1 08 09 0a 0b 0c 0d 0e 0f 10 11

    2 12 13 14 15 16 17 18 19 1a

    3 1b 1c 1d 1e 1f

    TABLE VIIMemory (RAM) Space Associated with Table VI and Figs. 3 and 4

  • 8/7/2019 Data Compression Huffman Codes

    28/60

    28

    Storage Allocation

    For non-uniform source, SGH-Tree becomes sparse.

    How toi) optimize the storage space

    ii) provide quick access to the symbol (data)

    key Idea:

    Break down the SGH-Tree into smaller clusters(subtrees) such that the memory efficiencyincreases.

    The memory efficiency B for a K-level binary Huffmantree

    %1002

    nodesleafeffectiveofNo.

    KKB

  • 8/7/2019 Data Compression Huffman Codes

    29/60

    29

    Ex:1

    2 3

    6 3

    14 15

    28 29 30 31

    62 63

    S1

    S2

    S3 S4 S5

    S6 S7

    a1

    a2a3

    a4

    a5

    2/21 : 100%

    3/22 : 75%

    4/23 : 50%

    6/24 : 37%

    7/25 : 22%

  • 8/7/2019 Data Compression Huffman Codes

    30/60

    30

    Remarks:

    1. The efficiency changes only when we switch to a new level(or equivalently to a new CL), and it decreases as weproceed to the higher levels.

    2. Memory efficiency can be interpreted as a measure of theperformance of the system in terms of memory spacerequirement; and it is directly related to the sparsity of the

    Huffman tree.

    3. Higher memory efficiency for the top levels (with smaller CL)is a clear indication that partitioning the tree into smaller andless sparse clusters will reduce the memory size. In addition,clustering also helps to reduce the search time for a symbol.

    Definition:

    A cluster (subtree) Ti with minimum memory efficiency (MME) Bi,if there is no level in Ti with memory efficiency less than Bi.

  • 8/7/2019 Data Compression Huffman Codes

    31/60

    31

    SGH-Tree Clustering

    Given a SGH-tree, as shown in Fig. 2, depending onthe MME (or CL) assigned, the tree is partitioned by a

    cut line, x-x, at the Lth level (L=4 for the choice ofMME=50%, in this example).

    The first cluster (subtree), as shown in Fig.3(a), isformed by removing the remainder of the tree beyondthe cut-line x-x.

    The cluster length is defined to be the maximum pathlength from the root to a node within the cluster the cluster length for the first cluster is 4.

    Associated with each cluster a look up table (LUT) is

    assigned, as shown at the bottom of Fig.3(a), toprovide the addressing information for thecorresponding terminal node (symbol) within thecluster, or beyond.

  • 8/7/2019 Data Compression Huffman Codes

    32/60

    32

    To identify other clusters, in the tree, we draw more cut lines y-y,and z-z, each L levels apart.

    More clusters are generated, each of which starting from a singlenode (root of the cluster) and expanded until it is terminatedeither by terminal nodes, or nodes being intercepted by the nextcute line.

    Next, we construct a super-tree (s-tree) corresponding to a

    SGH-Tree. In a s-tree each cluster is represented by a node,and the links connecting these nodes, representing the branchingnodes in the SGH-tree, shared between two clusters.

    The super-table (ST) associated with the s-tree is shown at thebottom of the tree.

    Note that the s-tree has 7 nodes, one for each cluster, while itsST has 6 entries. This is because the root cluster a is left out andthe table starts from cluster b.

  • 8/7/2019 Data Compression Huffman Codes

    33/60

    33

    Entries in the ST and the LUTs

    There are two numbers in each location in the STthe first number identifies the cluster length

    the 2nd

    number is the offset memory address for that clusterEx:

    cluster length : 11+1 = 100, or 42aH : the starting address of the corresponding LUT, in thememory (see table)

    the cluster f start at address 2aH in the memory table. (i.e.symbol 18)

    Each entry in a LUT is an integer in sign/magnitudeformat.Positive integer, with 0 sign, correspond to the nodes existed inthe cluster, while negative numbers, with 1 sign, represent thenodes being cut by the next cut line.

    11 2aHf

    binary Hexa

  • 8/7/2019 Data Compression Huffman Codes

    34/60

    34

    The magnitude of a negative integer specifies a location in the ST,for further search.For example, consider the cluster-C

    4 4 4 4 10 10 11 11 24 25 26 27 28 3 4 5

    0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Sign/magnitude f28 ed27262524

    0e 0f 10 11 12yy

    14 15

    76

    12 13

    4

    6

    C

    5

    2

    10 110c 0d

    0b

  • 8/7/2019 Data Compression Huffman Codes

    35/60

    35

    In the 15th entry of the above LUT we find as sign/magnitudeat that location.

    the negative sign (1) indicates that we have to move to othercluster, and 4 refers to the 4th entry in the ST, whichcorresponds to cluster e, and contains 01 and 26H numbers.

    The first number, 01, indicates the cluster length is 01+1=10(or

    2), and the 2nd number, shows the starting address of cluster e inthe memory.

    A positive entry in a LUT, indicates that the symbol isalready found, and the magnitude comprises three

    pieces of informationi) location of the symbol in the memory

    ii) the pertinent codeword

    iii) the codeword length

  • 8/7/2019 Data Compression Huffman Codes

    36/60

    36

    Read the positive integer in binary form, andidentify the most significant 1 bit (MS1B).The position of the MS1B in the binarynumber specifies the codeword length, therest of the bits on the right side of the MS1B

    gives the codeword, cj, and the magnitude ofcj is, in fact, the relative address of thesymbol in the memory (Table)

    F l

  • 8/7/2019 Data Compression Huffman Codes

    37/60

    37

    For example

    1 0 4

    2 0 4

    3 0 44 0 4

    5 0 5

    6 0 5

    7 0 5

    8 0 5

    9 0 24

    10 0 25

    11 0 26

    12 0 27

    13 0 28

    14 0 29

    15 1 1

    16 1 2

    07 symbol

    0 the symbol is found

    29 = 0 0 0 1 1 1 0 1

    Symbol 07 is located at 1101=dH

    7 6 5 4 3 2 1 0

    Codeword length = 4

    (See Table )0 1 2 3 4 5 6 7 8 9 a b c d e f

    00 01 02 03 04 05 06 07

    The 1st row of the table

    29)H= f)H+ d)H

    indication of the tree-depth of the node

    memory location (address) if the

    memory table is offset by f

    Cluster length

    (codeword length)

    LUT for Fig.3(a)

  • 8/7/2019 Data Compression Huffman Codes

    38/60

    38

    Huffman Decoding

    The decoding procedure basically starts by receivingan L-bit code cj, where L is the length of the top

    cluster in the associated SGH-Tree (or the SGHT).This L-bit code cj is then used as the address to theassociated look-up table [fig.3(a)]

    Example:1. Received codeword : 01100 1011

    first L=4 bit 0110=6 as an address to the LUT given inFig.3(a).The content of the table at this location is 0/5, as thesign/magnitude.

    0the symbol is located in this cluster

    5 = 0000101, the MS1B is at location 2 CL=2. Next to the MS1B we have 01, whichrepresents the codeword

    the symbol (01H) is found at the address 01 in thememory (Table)

    MSB

    MS1B

  • 8/7/2019 Data Compression Huffman Codes

    39/60

    39

    2. Rd codeword : 110110011the first 4 bit 1101 = d)H at the d-th location of the LUT

    (Fig.3(a)) we get 0/29

    3. Rd codeword = 1111 1111 1111 00101111 the symbol is not in cluster a and refer to

    location 2 in the ST, assigned to cluster C.

    MSB

    The symbol is at this cluster

    29 = 00011101MS1B

    The address to the memory is 1101=d)HAnd the symbol is found to be 07.

    The offset for the symbol

    memory designate for cluster C

    : the memory addressing (table

    ) of cluster C started at 14)H

    11)2|14)H : the cluster length is 11+1=100 or 4

  • 8/7/2019 Data Compression Huffman Codes

    40/60

    40

    Our next move is to select another slice of (with length 4) from the

    bit stream, which 1111 again. From the 15th location of the LUT

    for cluster C we get 1/5

    we move to the 15th location in the LUT for cluster f. Here we find

    1/6 and refer to the 6th item in the ST.The data here is read 00)2 and 3a)H

    00 CL of cluster g is 00+1=01 ; or 13a)H the memory location offset of cluster g.

    The symbol is not in cluster C

    Location 5 of the ST cluster f

    The content is 11)2 2a)H11+1 = 100 : or 4 : cluster length2a)H : the offset of memory location of cluster f

    The symbol is not located yet, we have to choose

    another 4 bit slice from the bit stream : 1111

  • 8/7/2019 Data Compression Huffman Codes

    41/60

  • 8/7/2019 Data Compression Huffman Codes

    42/60

    42

    Remarks:

    1.For high probable symbols with short codewords (4bits or less) the search for the symbol is very fast,and is completed in the first try. For longercodewords, however, the search time grows almostproportional to the codeword length.

    If CL is the codeword lengthL is the maximum level selected for each cluster

    ( L = 4 in the above example) then the search time isclosely proportional to 1+CL/L

    2. Increasing L:

    i. Decreasing search time speed up decoding

    ii. Growing the memory space requirement

    Trade-off

  • 8/7/2019 Data Compression Huffman Codes

    43/60

  • 8/7/2019 Data Compression Huffman Codes

    44/60

    44

    3. Augment S1 by Q to form a new set W. Construct an optimalprefix code for W using the design procedure for

    unconstrained length Huffman codewords. codewords cs, for symbols in the set S1

    codeword cq for the symbols Q.cq is the shortened prefix-code for symbols in S2

    If li is the length of the i-th codeword of S1, then

    x)toequalorlargerintegersmallestthedenotesx(

    1logmaxmax 211 Lp

    li

    SsiSs ii

    E di i t t i

  • 8/7/2019 Data Compression Huffman Codes

    45/60

    45

    Encoding : input message string m1,m2,,mk

    For all miS1, output the corresponding codewordfrom .

    For all miS2, output the codeword cq followed by anL-bit fixed-length binary representation for mi.(Actually, one can use fewer than L bits, since if thereare symbols in S2 and , then the fixed-

    length binary representation for each mi isbits).

    1sC

    2sN NNs 2

    log22 s

    N

  • 8/7/2019 Data Compression Huffman Codes

    46/60

    46

    Let lsh be the average codeword length for the shortened codes.lw be the average codeword length for W.

    The worst-case increase in the average codeword length for theshortened code is bits per symbol (this function attains a

    maximum value of ).

    5.1

    11log)(

    11

    log1

    log1

    log11

    log

    1log

    2

    1log

    1

    1

    1log

    1log

    1log

    1log

    2

    2222

    22

    22

    22

    212

    222

    21

    1

    sHlavesH

    qqsH

    qq

    pp

    pp

    ppwHlsH

    pppLpqL

    qLwHlsH

    wHlwH

    qLll

    wHp

    pp

    psH

    q

    q

    p

    pwH

    Si i

    i

    Si i

    i

    Si i

    ish

    Si i

    i

    SiLi

    Si

    i

    sh

    w

    wsh

    Si i

    i

    Si i

    i

    Si i

    i

    qq

    1log2

  • 8/7/2019 Data Compression Huffman Codes

    47/60

    47

    Example:Symbol Si pi li Codeword

    0 0.2820 2 11

    1 0.2786 2 10

    2 0.1419 3 011

    3 0.1389 3 010

    4 0.0514 4 0011

    5 0.0513 4 0010

    6 0.0153 5 00011

    7 0.0153 5 00010

    8 0.0072 6 000011

    9 0.0068 6 000010

    10 0.0038 7 000001111 0.0032 7 0000010

    12 0.0019 7 0000001

    13 0.0013 8 00000001

    14 0.0007 9 000000001

    15 0.0004 9 000000000

    bits694.215

    0

    i iipllave

    The longest codeword is 9 bits

    29 = 512 - entry table is needed

    for lookup-table-based decoding

    Now suppose only a 128-entrylookup table can be permitted.

    7-bit shortened Huffman code

    Code construction

  • 8/7/2019 Data Compression Huffman Codes

    48/60

    48

    Code construction

    1.

    0253.0

    15.to8iforsince,,,

    ,,

    15

    8

    1281

    15982

    7101

    i

    i

    i

    Pq

    PsssS

    sssS

    S0 S1 S2 S3 S4 S5 S6 S7 Q

    11 01 101 100 0011 0010 00011 00010 0000

    0.2820 0.2786 0.1419 0.1389 0.0514 0.0513 0.0153 0.0153 0.0253

  • 8/7/2019 Data Compression Huffman Codes

    49/60

    49

    Symbol i pi li Codeword Additional

    0 0.2820 2 11

    1 0.2786 2 01

    2 0.1419 3 101

    3 0.1389 3 100

    4 0.0514 4 0011

    5 0.0513 4 0010

    6 0.0153 5 000117 0.0153 5 00010

    8 0.0072 11 0000 0001000

    9 0.0068 11 0000 0001001

    10 0.0038 11 0000 0001010

    11 0.0032 11 0000 0001011

    12 0.0019 11 0000 0001100

    13 0.0013 11 0000 0001101

    14 0.0007 11 0000 0001110

    15 0.0004 11 0000 0001111

  • 8/7/2019 Data Compression Huffman Codes

    50/60

    50

    2. For all symbols in S2 we need a prefix code of 4 bitsfollowed by a 7-bit representation for the specific

    symbol in S2

    bits8057.21115

    8

    7

    0

    i

    i

    i

    iish ppll

  • 8/7/2019 Data Compression Huffman Codes

    51/60

    51

    Decoding:

    1. We first construct a lookup table as described above.

    2. From the input bit stream, we fetch bits into a buffer until the bufferhas 7 bits. We access the lookup table location, using the 7 bits asan address.This lookup table location contains (mk, lk)

    3. The first lk bits in the buffer are discarded by shifting the buffercontents to the left by l

    k

    bits positions.

    If mk Q, mk{S0, S1,S7}, and thus we have correctly decoded thissymbol

    If mk = Q, additional bits from the input bit stream are needed fordecoding. We fetch lk bits from the bit stream to fill up the buffer. Thebuffer now contains the binary representation for one of the symbols S8,S9,S15, and thus we have correctly decoded a symbol from S2.

    4. Repeat steps 2 and 3 until the complete message has beendecoded.

  • 8/7/2019 Data Compression Huffman Codes

    52/60

    52

    The key disadvantage of two-level decoding is that wecannot guarantee a constant symbol rate at the

    Huffman decoder output.

    Lookup table size (entries) Worst case lsh

    -lave

    (bits/symbol)

    16 0.4213

    32 0.2326

    64 0.2326

    128 0.1342

    256 0.0731512 0.0338

    Constrained-length Huffman codes: prefix-free constant

  • 8/7/2019 Data Compression Huffman Codes

    53/60

    53

    g poutput decoding rate with a table-lookup decoder

    For a maximum codeword length of L bits, we define a thresholdT = 2-L

    Sort si, i = 1, 2, , N so that pk pk+1 For each pi, if pi T, set pi = T

    Design the codebook using the modified pi values and theunconstrained-length Huffman code table design approach

    Since pi is restricted to at most 2-L, no codeword length will

    exceed L bits.

    Codeword length 1/ pi : not guaranteeThis is due to the fact that some of the properties were set to thethreshold T and hence the ordering among the properties isobscured.

    Rearranging is done by simply sorting the codeword lengths inascending order of magnitude and associating this sorted list tothe corresponding list of codewords.

    Reordered

  • 8/7/2019 Data Compression Huffman Codes

    54/60

    54

    Symbol i pi l CodewordReordered

    l codeword

    0 0.2820 2 11 2 11

    1 0.2786 2 01 2 01

    2 0.1419 3 101 3 1013 0.1389 3 100 3 100

    4 0.0514 4 0010 4 0010

    5 0.0513 4 0001 4 0001

    6 0.0153 6 001100 6 001100

    7 0.0153 6 001101 6 001101

    8 0.0072 7 0011110 6 000010

    9 0.0068 7 0011111 6 000011

    10 0.0038 7 0011100 6 000000

    11 0.0032 7 0011101 6 000001

    12 0.0019 6 000010 7 0011110

    13 0.0013 6 000011 7 0011111

    14 0.0007 6 000000 7 0011100

    15 0.0004 6 000001 7 0011101

    lave 2.7308 2.7141

    : single-layer decoding

    Constrained-length Huffman codes

  • 8/7/2019 Data Compression Huffman Codes

    55/60

    55

    Constrained length Huffman codes: The Voorhis method [1974, IEEE Trans. IT]: near optimum codeword length

    Determine code lengths l1, l2,, lN that minimize subject tothe constraints 1li L. For unique decodable codes, we alsorequire that . The resulting codeword lengths will be suchthat 1 l1 l2 lN L

    The I-th codeword is the first li bits of the fraction computed by

    .

    For an N-symbol alphabet, if L=log2N+d, then this method has acomplexity of O(dN2).

    N

    N

    NPPP

    PPP

    SSS

    21

    21

    21and,

    ,,,

    ,,,

    N

    i

    iipl

    12 N

    i

    li

    1

    1

    2

    i

    k

    lk

  • 8/7/2019 Data Compression Huffman Codes

    56/60

    56

    Symbol Si Voorhis code

    0 11

    1 10

    2 0113 010

    4 0011

    5 0010

    6 00011

    7 00010

    8 0000111

    9 0000110

    10 0000101

    11 0000100

    12 0000011

    13 0000010

    14 0000001

    15 0000000

    lave 2.7045

  • 8/7/2019 Data Compression Huffman Codes

    57/60

    57

    Home Work:

    1. Consider codes that satisfy the suffix condition, which

    says that no codeword is a suffix of any othercodeword. Show that a suffix condition code isuniquely decodable, and show that the minimumaverage length over all codes satisfying the suffixcondition is the same as the average length of theHuffman code for that random variable.

    Suffix code

    2. Suppose that X=i with probability Pi, i=1, 2, m. Let libe the number of binary symbols in the codeword

    associated with X=i, and let ci denote the cost perletter of the codeword when X=i. Thus the averagecost C of the description of X is

    m

    i

    iii lcpC1

  • 8/7/2019 Data Compression Huffman Codes

    58/60

    58

    a) Minimize C over all l1,l2,,lm such that 2-li1. Ignore any

    implied integer constraints on li. Exhibit the minimizing

    l1*

    ,l2*

    ,,lm*

    and the associated minimum value C*

    .b) How would you use the Huffman code procedure to minimize

    C over all uniquely decodable codes? Let CHuffman denote thisminimum. Show that

    Huffman codes with costs.

    m

    iiiHuffman cpCCC 1

    **

  • 8/7/2019 Data Compression Huffman Codes

    59/60

    59

    3. A computer generates a number X according to aknown prob. mass function p(x), x{1,2,,100}. The

    player asks arbitrary Yes-No questions sequentiallyuntil X is determined. If he is right (i.e., X isdetermined), he receives a prize of value v(x).

    a) How should the player proceed to maximize his expected

    winnings? What is his expected return?b) Continuing (a), what if v(x) is fixed, but p(x) can be chosen by

    the computer (and then announced to the player)? Thecomputer wishes to minimize the players expected return.What should p(x) be? What is the expected return to the

    player? The game of Hi-Lo.

  • 8/7/2019 Data Compression Huffman Codes

    60/60

    60

    4. Although the codeword lengths of an optimal variablelength code are complicated functions of the

    message probabilities {p1,p2,,pm}, it can be said thatless probable symbols are encoded into longercodewords. Suppose that the message probabilitiesare given in decreasing order p1p2 pm.

    a) Prove that for any binary Huffman code, if the most probablemessage symbol has probability p1>2/5, then that symbolmust be assigned a codeword of length 1.

    b) Prove that for any binary Huffman code, if the most probablemessage symbol has probability p1