UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding...

of 54/54
UNIT II TEXT COMPRESSION
  • date post

    13-Dec-2015
  • Category

    Documents

  • view

    273
  • download

    14

Embed Size (px)

Transcript of UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding...

  • Slide 1

UNIT II TEXT COMPRESSION Slide 2 a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding Dictionary techniques LZW family algorithms. Compression techniques Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding Dictionary techniques LZW family algorithms. Slide 3 Introduction Introduction Compression: the process of coding that will effectively reduce the total number of bits needed to represent certain information. 3 Fig.1: A General Data Compression Scheme. Slide 4 15.4 Data compression implies sending or storing a smaller number of bits. Although many methods are used for this purpose, in general these methods can be divided into two broad categories: lossless and lossy methods. Data compression methods Slide 5 Introduction Introduction If the compression and decompression processes induce no information loss, then the compression scheme is lossless; otherwise, it is lossy. Compression ratio: B 0 number of bits before compression B 1 number of bits after compression In general, we would desire any codec (encoder/decoder scheme) to have a compression ratio much larger than 1.0. The higher the compression ratio, the better the lossless compression scheme, as long as it is computationally feasible. 5 Slide 6 Basics of Information Theory Basics of Information Theory 6 What is entropy? is a measure of the number of specific ways in which a system may be arranged, commonly understood as a measure of the disorder of a system.systemdisorder As an example, if the information source S is a gray-level digital image, each s i is a gray-level intensity ranging from 0 to (2 k 1), where k is the number of bits used to represent each pixel in an uncompressed image. We need to find the entropy of this image; which the number of bits to represent the image after compression. Slide 7 Run-Length Coding Run-Length Coding RLC is one of the simplest forms of data compression. The basic idea is that if the information source has the property that symbols tend to form continuous groups, then such symbol and the length of the group can be coded. Consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. Let us take a hypothetical single scan line, with B representing a black pixel and W representing white: WWWWWBWWWWBBBWWWWWWBWWW If we apply the run-length encoding (RLE) data compression algorithm to the above hypothetical scan line, we get the following: 5W1B4W3B6W1B3W The run-length code represents the original 21 characters in only 14. 7 Slide 8 Variable-Length Coding 8 Variable-length coding (VLC) is one of the best-known entropy coding methods Here, we will study the ShannonFano algorithm, Huffman coding, and adaptive Huffman coding. Slide 9 ShannonFano Algorithm ShannonFano Algorithm 9 To illustrate the algorithm, let us suppose the symbols to be coded are the characters in the word HELLO. The frequency count of the symbols is Symbol H E L O Count 1 1 2 1 The encoding steps of the ShannonFano algorithm can be presented in the following top-down manner: 1. Sort the symbols according to the frequency count of their occurrences. 2. Recursively divide the symbols into two parts, each with approximately the same number of counts, until all parts contain only one symbol. Slide 10 ShannonFano Algorithm ShannonFano Algorithm 10 A natural way of implementing the above procedure is to build a binary tree. As a convention, let us assign bit 0 to its left branches and 1 to the right branches. Initially, the symbols are sorted as LHEO. As Fig. 7.3 shows, the first division yields two parts: L with a count of 2, denoted as L:(2); and H, E and O with a total count of 3, denoted as H, E, O:(3). The second division yields H:(1) and E, O:(2). The last division is E:(1) and O:(1). Slide 11 ShannonFano Algorithm ShannonFano Algorithm 11 Fig. 7.3: Coding Tree for HELLO by Shannon-Fano. Slide 12 Table 7.1: Result of Performing Shannon-Fano on HELLO Li & Drew12 SymbolCountLog 2 Code# of bits used L21.3202 H12.32102 E12.321103 O12.321113 TOTAL # of bits:10 Slide 13 Another coding tree for HELLO by Shannon-Fano. Li & Drew13 Slide 14 Another Result of Performing Shannon-Fano on HELLO (see Fig. 7.4) Li & Drew14 SymbolCountLog 2 Code# of bits used L21.3200 4 H12.32012 E12.32102 O12.32112 TOTAL # of bits:10 Slide 15 ShannonFano Algorithm-Analysis 15 The ShannonFano algorithm delivers satisfactory coding results for data compression, but it was soon outperformed and overtaken by the Huffman coding method. The Huffman algorithm requires prior statistical knowledge about the information source, and such information is often not available. This is particularly true in multimedia applications, where future data is unknown before its arrival, as for example in live (or streaming) audio and video. Even when the statistics are available, the transmission of the symbol table could represent heavy overhead The solution is to use adaptive Huffman coding compression algorithms, in which statistics are gathered and updated dynamically as the data stream arrives. Slide 16 15.16 LOSSLESS COMPRESSION LOSSLESS COMPRESSION In lossless data compression, the integrity of the data is preserved. The original data and the data after compression and decompression are exactly the same because, in these methods, the compression and decompression algorithms are exact inverses of each other: no part of the data is lost in the process. Redundant data is removed in compression and added during decompression. Lossless compression methods are normally used when we cannot afford to lose any data. Slide 17 15.17 Run-length encoding Run-length encoding is probably the simplest method of compression. It can be used to compress data made of any combination of symbols. It does not need to know the frequency of occurrence of symbols and can be very efficient if data is represented as 0s and 1s. The general idea behind this method is to replace consecutive repeating occurrences of a symbol by one occurrence of the symbol followed by the number of occurrences. The method can be even more efficient if the data uses only two symbols (for example 0 and 1) in its bit pattern and one symbol is more frequent than the other. Slide 18 15.18 Run-length encoding example Slide 19 15.19 Run-length encoding for two symbols Slide 20 15.20 Huffman coding Huffman coding assigns shorter codes to symbols that occur more frequently and longer codes to those that occur less frequently. For example, imagine we have a text file that uses only five characters (A, B, C, D, E). Before we can assign bit patterns to each character, we assign each character a weight based on its frequency of use. In this example, assume that the frequency of the characters is as shown in Table 15.1. Slide 21 15.21 Huffman coding Slide 22 15.22 A characters code is found by starting at the root and following the branches that lead to that character. The code itself is the bit value of each branch on the path, taken in sequence. Final tree and code Slide 23 15.23 Encoding Let us see how to encode text using the code for our five characters. Figure 15.6 shows the original and the encoded text. Huffman encoding Slide 24 15.24 Decoding The recipient has a very easy job in decoding the data it receives. shows how decoding takes place. Huffman decoding Slide 25 page 2505/06/15 CSE 40373/60373: Multimedia Systems Adaptive Huffman Coding Extended Huffman is in book: group symbols together Adaptive Huffman: statistics are gathered and updated dynamically as the data stream arrives ENCODER ------- Initial_code(); while not EOF { get(c); encode(c); update_tree(c); } DECODER ------- Initial_code(); while not EOF { decode(c); output(c); update_tree(c); } Slide 26 26 Adaptive Coding Motivations: The previous algorithms (both Shannon-Fano and Huffman) require the statistical knowledge which is often not available (e.g., live audio, video). Even when it is available, it could be a heavy overhead. Higher-order models incur more overhead. For example, a 255 entry probability table would be required for a 0-order model. An order-1 model would require 255 such probability tables. (A order-1 model will consider probabilities of occurrences of 2 symbols) The solution is to use adaptive algorithms. Adaptive Huffman Coding is one such mechanism that we will study. The idea of adaptiveness is however applicable to other adaptive compression algorithms. Slide 27 27 Adaptive Coding ENCODER Initialize_model(); do { c = getc( input ); encode( c, output ); update_model( c ); } while ( c != eof) DECODER Initialize_model(); while ( c = decode (input)) != eof) { putc( c,output) update_model( c ); } The key is that, both encoder and decoder use exactly the same initialize_model and update_model routines. Slide 28 28 The Sibling Property The node numbers will be assigned in such a way that: 1. A node with a higher weight will have a higher node number 2. A parent node will always have a higher node number than its children. In a nutshell, the sibling property requires that the nodes (internal and leaf) are arranged in order of increasing weights. The update procedure swaps nodes in violation of the sibling property. The identification of nodes in violation of the sibling property is achieved by using the notion of a block. All nodes that have the same weight are said to belong to one block Slide 29 29 Flowchart of the update procedure START First appearance of symbol Go to symbol external node Node number max in block? Increment node weight Switch node with highest numbered node in block Is this the root node? Go to parent node STOP NYT gives birth To new NYT and external node Increment weight of external node and old NYT node; Adjust node numbers Go to old NYT node Yes No Yes No The Huffman tree is initialized with a single node, known as the Not-Yet-Transmitted (NYT) or escape code. This code will be sent every time that a new character, which is not in the tree, is encountered, followed by the ASCII encoding of the character. This allows for the de-compressor to distinguish between a code and a new character. Also, the procedure creates a new node for the character and a new NYT from the old NYT node. The root node will have the highest node number because it has the highest weight. Slide 30 30 Example B W=2 #1 C W=2 #2 D W=2 #3 W=2 #4 W=4 #5 W=6 #6 E W=10 #7 Root W=16 #8 Counts: (number of occurrences) B:2 C:2 D:2 E:10 Example Huffman tree after some symbols have been processed in accordance with the sibling property NYT #0 Initial Huffman Tree #0 Slide 31 31 Example W=1 #2 B W=2 #3 C W=2 #4 D W=2 #5 W=2+1 #6 W=4 #7 W=6+1 #8 E W=10 #9 Root W=16+1 #10 Counts: (number of occurrences) A:1 B:2 C:2 D:2 E:10 A Huffman tree after first appearance of symbol A A W=1 #1 NYT #0 Slide 32 32 Increment B W=2 #3 C W=2 #4 D W=2 #5 W=3+1 #6 W=4 #7 W=7+1 #8 E W=10 #9 Root W=17+1 #10 Counts: A:1+1 B:2 C:2 D:2 E:10 An increment in the count for A propagates up to the root W=1+1 #2 A W=1+1 #1 NYT #0 Slide 33 33 Swapping B W=2 #3 C W=2 #4 D W=2 #5 W=4 #6 W=4 #7 W=8 #8 E W=10 #9 Root W=18 #10 Counts: A:2+1 B:2 C:2 D:2 E:10 B W=2 #3 C W=2 #4 A W=2+1 #5 W=4 #6 W=4+1 #7 W=8+1 #8 E W=10 #9 Root W=18+1 #10 Counts: A:3 B:2 C:2 D:2 E:10 Swap nodes 1 and 5 Another increment in the count for A results in swap W=2 #2 A W=2 #1 NYT W=2 #2 D W=2 #1 NYT #0 Slide 34 34 Swapping contd. B W=2 #3 C W=2 #4 A W=3+1 #5 W=4 #6 W=5+1 #7 W=9+1 #8 E W=10 #9 Root W=19+1 #10 Counts: A:3+1 B:2 C:2 D:2 E:10 Another increment in the count for A propagates up W=2 #2 D W=2 #1 NYT #0 Slide 35 35 Swapping contd. B W=2 #3 C W=2 #4 A W=4 #5 W=4 #6 W=6 #7 W=10 #8 E W=10 #9 Root W=20 #10 Counts: A:4+1 B:2 C:2 D:2 E:10 Swap nodes 5 and 6 Another increment in the count for A causes swap of sub-tree W=2 #2 D W=2 #1 NYT #0 Slide 36 36 Swapping contd. C W=2 #4 W=6 #7 W=10 #8 E W=10 #9 Root W=20 #10 Counts: A:4+1 B:2 C:2 D:2 E:10 B W=2 #3 W=4 #5 A W=4+1 #6 Swap nodes 8 and 9 Further swapping needed to fix the tree W=2 #2 D W=2 #1 NYT #0 Slide 37 37 Swapping contd. C W=2 #4 W=6 #7 W=10+1 #9 Root W=20+1 #10 Counts: A:5 B:2 C:2 D:2 E:10 B W=2 #3 W=4 #5 A W=5 #6 E W=10 #8 W=2 #2 D W=2 #1 NYT #0 Slide 38 38 Arithmetic Coding Arithmetic coding is based on the concept of interval subdividing. In arithmetic coding a source ensemble is represented by an interval between 0 and 1 on the real number line. Each symbol of the ensemble narrows this interval. As the interval becomes smaller, the number of bits needed to specify it grows. Arithmetic coding assumes an explicit probabilistic model of the source. It uses the probabilities of the source messages to successively narrow the interval used to represent the ensemble. A high probability message narrows the interval less than a low probability message, so that high probability messages contribute fewer bits to the coded ensemble Slide 39 39 Arithmetic Coding: Description In the following discussions, we will use M as the size of the alphabet of the data source, N[x] as symbol x's probability, Q[x] as symbol x's cumulative probability (i.e., Q[i]=N[0]+N[1]+...+N[i]) Assume we know the probabilities of each symbol of the data source, we can allocate to each symbol an interval with width proportional to its probability, and each of the intervals does not overlap with others. This can be done if we use the cumulative probabilities as the two ends of each interval. Therefore, the two ends of each symbol x amount to Q[x-1] and Q[x]. Symbol x is said to own the range [Q[x-1], Q[x]). Slide 40 40 Arithmetic Coding: Encoder We begin with the interval [0,1) and subdivide the interval iteratively. For each symbol entered, the current interval is divided according to the probabilities of the alphabet. The interval corresponding to the symbol is picked as the interval to be further proceeded with. The procedure continues until all symbols in the message have been processed. Since each symbol's interval does not overlap with others, for each possible message there is a unique interval assigned. We can represent the message with the interval's two ends [L,H). In fact, taking any single value in the interval as the encoded code is enough, and usually the left end L is selected. Slide 41 41 Arithmetic Coding Algorithm L = 0.0; H = 1.0; While ( (x = getc(input)) != EOF ) { R = (H-L); H = L + R * Q[x]; L = L + R * Q[x-1]; } Output(L); R is the interval range, and H and L are two ends of the current code interval. x is the new symbol to be encoded. H and L are initialized to 0 and 1 respectively Slide 42 42 Arithmetic Coding: Encoder example Symbol, xProbability, N[x][Q[x-1], Q[x]) A0.40.0, 0.4 B0.30.4, 0.7 C0.20.7, 0.9 D0.10.9, 1.0 1 0 B 0.4 0.70.67 0.61 C 0.634 0.61 A 0.6286 0.6196 B String: BCAB Code sent: 0.6196 Slide 43 43 Decoding Algorithm When decoding the code v is placed on the current code interval to find the symbol x so that Q[x-1] 45 Dictionary-Based Compression The compression algorithms we studied so far use a statistical model to encode single symbols Compression: Encode symbols into bit strings that use fewer bits. Dictionary-based algorithms do not encode single symbols as variable- length bit strings; they encode variable-length strings of symbols as single tokens The tokens form an index into a phrase dictionary If the tokens are smaller than the phrases they replace, compression occurs. Dictionary-based compression is easier to understand because it uses a strategy that programmers are familiar with-> using indexes into databases to retrieve information from large amounts of storage. Telephone numbers Postal codes Slide 46 46 Dictionary-Based Compression: Example Consider the Random House Dictionary of the English Language, Second edition, Unabridged. Using this dictionary, the string: A good example of how dictionary based compression works can be coded as: 1/1 822/3 674/4 1343/60 928/75 550/32 173/46 421/2 Coding: Uses the dictionary as a simple lookup table Each word is coded as x/y, where, x gives the page in the dictionary and y gives the number of the word on that page. The dictionary has 2,200 pages with less than 256 entries per page: Therefore x requires 12 bits and y requires 8 bits, i.e., 20 bits per word (2.5 bytes per word). Using ASCII coding the above string requires 48 bytes, whereas our encoding requires only 20 (