Chapter 7 Special Section Focus on Data Compression.

29
Chapter 7 Special Section Focus on Data Compression

Transcript of Chapter 7 Special Section Focus on Data Compression.

Page 1: Chapter 7 Special Section Focus on Data Compression.

Chapter 7 Special Section

Focus on Data Compression

Page 2: Chapter 7 Special Section Focus on Data Compression.

2

7A Objectives

• Understand the essential ideas underlying data compression.

• Become familiar with the different types of compression algorithm.

• Be able to describe the most popular data compression algorithms in use today and know the applications for which each is suitable.

Page 3: Chapter 7 Special Section Focus on Data Compression.

3

• Data compression is important to storage systems because it allows more bytes to be packed into a given storage medium than when the data is uncompressed.

• Some storage devices (notably tape) compress data automatically as it is written, resulting in less tape consumption and significantly faster backup operations.

• Compression also reduces file transfer time, saving time and communications bandwidth.

7A.1 Introduction

Page 4: Chapter 7 Special Section Focus on Data Compression.

4

• A good metric for compression is the compression factor (or compression ratio) given by:

• If we have a 100KB file that we compress to 40KB, we have a compression factor of:

7A.1 Introduction

Page 5: Chapter 7 Special Section Focus on Data Compression.

5

• Compression is achieved by removing data redundancy while preserving information content.

• The information content of a group of bytes (a message) is its entropy. – Data with low entropy permit a larger compression ratio than

data with high entropy.• Entropy, H, is a function of symbol frequency. It is the

weighted average of the number of bits required to encode the symbols of a message:

H= -P(x) log2P(xi)

7A.1 Introduction

Page 6: Chapter 7 Special Section Focus on Data Compression.

6

• The entropy of the entire message is the sum of the individual symbol entropies.

-P(x) log2P(xi)

• The average redundancy for each character in a

message of length l is given by:

P(x) li - -P(x) log2P(xi)

7A.1 Introduction

Page 7: Chapter 7 Special Section Focus on Data Compression.

7

• Consider the message: HELLO WORLD!– The letter L has a probability of 3/12 = 1/4 of appearing in

this message. The number of bits required to encode this symbol is -log2(1/4) = 2.

• Using our formula, -P(x) log2P(xi), the average

entropy of the entire message is 3.022. – This means that the theoretical minimum number of bits per

character is 3.022.

• Theoretically, the message could be sent using only 37 bits. (3.022 12 = 36.26)

7A.1 Introduction

Page 8: Chapter 7 Special Section Focus on Data Compression.

8

• The entropy metric just described forms the basis for statistical data compression.

• Two widely-used statistical coding algorithms are Huffman coding and arithmetic coding.

• Huffman coding builds a binary tree from the letter frequencies in the message.– The binary symbols for each character are read directly

from the tree.• Symbols with the highest frequencies end up at the

top of the tree, and result in the shortest codes.

7A.2 Statistical Coding

Page 9: Chapter 7 Special Section Focus on Data Compression.

9

• The process of building the tree begins by counting the occurrences of each symbol in the text to be encoded.

HIGGLETY PIGGLTY POP

THE DOG HAS EATEN THE MOP

THE PIGS IN A HURRY

THE CATS IN A FLURRY

HIGGLETY PIGGLTY POP

7A.2 Statistical Coding

Page 10: Chapter 7 Special Section Focus on Data Compression.

10

• Next, place the letters and their frequencies into a forest of trees that each have two nodes: one for the letter, and one for its frequency.

7A.2 Statistical Coding

Page 11: Chapter 7 Special Section Focus on Data Compression.

11

• We start building the tree by joining the nodes having the two lowest frequencies.

7A.2 Statistical Coding

Page 12: Chapter 7 Special Section Focus on Data Compression.

12

• And then we again join the nodes with two lowest frequencies.

7A.2 Statistical Coding

Page 13: Chapter 7 Special Section Focus on Data Compression.

13

• And again ....

7A.2 Statistical Coding

Page 14: Chapter 7 Special Section Focus on Data Compression.

14

• Here is our finished tree.

7A.2 Statistical Coding

Page 15: Chapter 7 Special Section Focus on Data Compression.

15

This is the code derived from this tree.

7A.2 Statistical Coding

Page 16: Chapter 7 Special Section Focus on Data Compression.

16

• The second type of statistical coding, arithmetic coding, partitions the real number interval between 0 and 1 into segments according to symbol probabilities.– An abbreviated algorithm for this process is given in the

text.• Arithmetic coding is computationally intensive and

it runs the risk of causing divide underflow.• Variations in floating-point representation among

various systems can also cause the terminal condition (a zero value) to be missed.

7A.2 Statistical Coding

Page 17: Chapter 7 Special Section Focus on Data Compression.

17

• For most data, statistical coding methods offer excellent compression ratios.

• Their main disadvantage is that they require two passes over the data to be encoded. – The first pass calculates probabilities, the second encodes

the message.

• This approach is unacceptably slow for storage systems, where data must be read, written, and compressed within one pass over a file.

7A.2 Statistical Coding

Page 18: Chapter 7 Special Section Focus on Data Compression.

18

• Ziv-Lempel (LZ) dictionary systems solve the two-pass problem by using values in the data as a dictionary to encode itself.

• The LZ77 compression algorithm employs a text window in conjunction with a lookahead buffer. – The text window serves as the dictionary. If text is found in

the lookahead buffer that matches text in the dictionary, the location and length of the text in the window is output.

7A.3 LZ Dictionary Systems

Page 19: Chapter 7 Special Section Focus on Data Compression.

19

• The LZ77 implementations include PKZIP and IBM’s RAMAC RVA 2 Turbo disk array.– The simplicity of LZ77 lends itself well to a hardware

implementation.

• LZ78 is another dictionary coding system. • It removes the LZ77 constraint of a fixed-size

window. Instead, it creates a trie as the data is read.• Where LZ77 uses pointers to locations in a

dictionary, LZ78 uses pointers to nodes in the trie.

7A.3 LZ Dictionary Systems

Page 20: Chapter 7 Special Section Focus on Data Compression.

20

• GIF compression is a variant of LZ78, called LZW, for Lempel-Ziv-Welsh.

• It improves upon LZ78 through its efficient management of the size of the trie.

• Terry Welsh, the designer of LZW, was employed by the Unisys Corporation when he created the algorithm, and Unisys subsequently patented it.

• Owing to royalty disputes, development of another algorithm PNG, was hastened.

7A.4 GIF and PNG Compression

Page 21: Chapter 7 Special Section Focus on Data Compression.

21

• PNG employs two types of compression, first a Huffman algorithm is applied, which is followed by LZ77 compression.

• The advantage that GIF holds over PNG, is that GIF supports multiple images in one file.

• MNG is an extension of PNG that supports multiple images in one file.

• GIF, PNG, and MNG are primarily used for graphics compression. To compress larger, photographic images, JEPG is often more suitable.

7A.4 GIF and PNG Compression

Page 22: Chapter 7 Special Section Focus on Data Compression.

22

• Photographic images incorporate a great deal of information. However, much of that information can be lost without objectionable deterioration in image quality.

• With this in mind, JPEG allows user-selectable image quality, but even at the “best” quality levels, JPEG makes an image file smaller owing to its multiple-step compression algorithm.

• It’s important to remember that JPEG is lossy, even at the highest quality setting. It should be used only when the loss can be tolerated.

The JPEG algorithm is illustrated on the next slide.

7A.5 JPEG Compression

Page 23: Chapter 7 Special Section Focus on Data Compression.

23

7A.5 JPEG Compression

Page 24: Chapter 7 Special Section Focus on Data Compression.

24

7A.6 MP3 Compression

• MP3 provides a good example of the tremendous power of applied mathematics and

algorithms.

– MP3 singularly changed the way that the world buys and enjoys music.

• The algorithm is not new: Karlheinz Brandenburg formulated the basis for MP3 in 1987 while a

doctoral student at Erlangen-Nuremberg University.

• A refined version was adopted by the Moving Picture Experts Group (MPEG) for the audio

component of movies.

Page 25: Chapter 7 Special Section Focus on Data Compression.

25

7A.6 MP3 Compression• MP3 is one of several related audio compression standards.

• MP3 is officially known as MPEG-1 Audio Layer III or MPEG-1 Part 3.

• Adopted by ISO as ISO/IEC 11172-3-1993.

• MP3 takes as its input a pulse-code modulation (PCM) stream sampled at 44.1kHz.

• Uncompressed, a 3-minute song would require 32MB of storage!

The next slide illustrates PCM encoding.

Page 26: Chapter 7 Special Section Focus on Data Compression.

26

7A.6 MP3 Compression

As the sampling rate

increases, fidelity improves,

but much more data is

created.

Page 27: Chapter 7 Special Section Focus on Data Compression.

27

• MP3 relies on psychoacoustic coding: the use of imperfections in human hearing.– There is no point in encoding sounds humans

can’t hear.• For the same reason, sounds at the margins of

hearing can tolerate more lossiness than those in the mid-ranges.

• Bandpass filterbanks allow certain frequencies and omit others. Scalefactor bands determine where lossiness can be tolerated in the output.

7A.6 MP3 Compression

The next slide illustrates the MP3 encoding process.

Page 28: Chapter 7 Special Section Focus on Data Compression.

28

7A.6 MP3 Compression

Note: The MP3 standard specifies only the algorithm. The

implementation is left to the hardware and software

companies.

Page 29: Chapter 7 Special Section Focus on Data Compression.

29

• Two approaches to data compression are statistical data compression and dictionary systems.

• Statistical coding requires two passes over the input, dictionary systems require only one.

• LZ77 and LZ78 are two popular dictionary systems.

• GIF, PNG, MNG, and JPEG are used for image compression; MP3 (and others) for audio.

• JPEG is lossy, so its use is not suited for all types of images.

Section 7A Conclusion