Lecture 1 introduction (Data Compression)

October 12, 2015 1 [email protected]


Introduction

What is information

Define data compression

History of compression technologies

Why do we still need compression

What is it possible to compress data

Lossless vs. lossy compression

Compression performance

Intuitive compression

How many bits per symbol

Information theory

Entropy

Contents

October 12, 2015 3

Helpful Knowledge

• Algorithm Design and Analysis

• Probability

[email protected]

Resources

Text Book

• Khalid Sayood, Introduction to Data compression, Fourth Edition,

Morgan Kaufmann Publishers, 2012.

• David Salomon, Data Compression The Complete Reference,

Fourth Edition, Springer-Verlag London Limited, 2007.

Papers and Sections from Books

October 12, 2015 4

Data can be characters in a text file, numbers that are samples of speech

or image waveforms, or sequences of numbers that are generated by

other processes.

Examples for the kind of data you typically want to compress are e.g.

• text

• source-code

• arbitrary files

• images

• video

• audio data

• speech

Obviously these data are fairly different in terms of data volume, data

structure, intended usage etc.

[email protected]

Introduction

October 12, 2015 5

Representation of data is a combination of information and redundancy.

Information is the portion of data that must be preserved permanently in

its original form in order to correctly interpret the meaning or purpose of the

data.

Redundancy is that portion of data that can be removed when it is not

needed or can be reinserted to interpret the data when needed. Most

often, the redundancy is reinserted in order to regenerate the original data

in its original form.

[email protected]

Introduction

October 12, 2015 6

o Analog data

Also called continuous data

Represented by real numbers (or complex numbers)

o Digital data

Finite set of symbols {a1, a2, ... , am}

All data represented as sequences (strings) in the symbol set.

Example: {a,b,c,d,r} abracadabra

Digital data can be an approximation to analog data

What is Information

[email protected]

October 12, 2015 7

o Roman alphabet plus punctuation

o ASCII - 256 symbols

o Binary - {0,1}

0 and 1 are called bits

All digital information can be represented efficiently in binary

{a,b,c,d} fixed length representation

2 bits per symbol

Symbols

[email protected]

October 12, 2015 8

Data compression is essentially a redundancy reduction technique.

Data compression is the art of reducing the number of bits needed to

store or transmit data.

Data compression is the process of converting an input data stream

(the source stream or the original raw data) into another data stream (the

output, the bit-stream, or the compressed stream) that has a smaller

size.

[email protected]

Define Data Compression

October 12, 2015 9

• 1st century B.C.: Steganography

• 19th century: Morse- and Braille alphabets

• 50ies of the 20th century: compression technologies exploiting statistical

redundancy are developed – bit-patterns with varying length are used to

represent individual symbols according to their relative frequency.

• 70ies: dictionary algorithms are developed – symbol sequences are

mapped to shorter indices using dictionaries.

• 70ies: with the ongoing digitization of telephone lines telecommunication

companies got interested in procedures how to get more channels on a

single wire.

• early 80ies: fax transmission over analog telephone lines.

• 80ies: first applications involving digital images appear on the market,

the “digital revolution” starts with compressing audio data

• 90ies: video broadcasting, video on demand, etc.

History of compression technologies

[email protected]

October 12, 2015 10

The reason we need data compression is that more and more of the

information that we generate and use is in digital form—consisting of

numbers represented by bytes of data.

Compression Technology is employed to efficiently use storage space.

To save on transmission capacity.

To save on transmission time.

Basically, its all about saving resources and money.

Reduce computation

Why do we still need compression ?

[email protected]

October 12, 2015 11

Compression is enabled by statistical and other properties of most data types,

however, data types exist which cannot be compressed, e.g. various kinds of noise

or encrypted data. Compression-enabling properties are:

• Statistical redundancy: in non-compressed data, all symbols are represented

with the same number of bits independent of their relative frequency (fixed

length representation).

• Correlation: adjacent data samples tend to be equal or similar (e.g. think of

images or video data).

There are different types of correlation:

– Positive correlation

– Negative correlation

– Perfect correlation

In addition, in many data types there is a significant amount of irrelevancy since

the human brain is not able to process and/or perceive the entire amount of data.

As a consequence, such data can be omitted without degrading perception.

Furthermore, some data contain more abstract properties which are independent

of time, location, and resolution and can be described very efficiently (e.g. fractal

properties).

Why is it possible to compress data ?

[email protected]

October 12, 2015 12

• A digital compression system requires two algorithms: Compression of data

at the source (encoding), and decompression at the destination (decoding).

• For stored multimedia data compression is usually done once at storage

time at the server and decoded upon viewing in real time.

[email protected]

October 12, 2015 [email protected] 13

Data Compression Methods

Data compression is about storing and sending a smaller number of

bits.

There’re two major categories for methods to compress data:

lossless and lossy methods

October 12, 2015 14

lossless vs. lossy compression

[email protected]

Used for compressing images and video files (our eyes cannot

distinguish subtle changes, so lossy data is acceptable).

These methods are cheaper, less time and space.

October 12, 2015 15

Lossless compression techniques, as their name implies, involve no loss of

information. If data have been losslessly compressed, the original data can be

recovered exactly from the compressed data. Lossless compression is generally

used for applications that cannot tolerate any difference between the original and

reconstructed data.

[email protected]

In lossless methods, original data and the data after compression and

decompression are exactly the same.

Redundant data is removed in compression and added during

decompression.

Lossless methods are used when we can’t afford to lose any data: legal

and medical documents, computer programs.

October 12, 2015 16

• Lossless compression 𝑋 = 𝑋

– Also called entropy coding, reversible coding.

• Lossy compression 𝑋 ≠ 𝑋

– Also called irreversible coding.

[email protected]

October 12, 2015 17

Compression performance

A very logical way of measuring how well a compression algorithm

compresses a given set of data is to look at the ratio of the number of bits

required to represent the data before compression to the number of bits

required to represent the data after compression.

Another way of reporting compression performance is to provide the

average number of bits required to represent a single sample. This is

generally referred to as the rate.

Compression ratio = size of the output streamsize of the input stream

Compression factor = size of the input stream

size of the output stream

[email protected]


Speed: When evaluating data compression algorithms, speed is always in terms of

uncompressed data handled per second.

For streaming audio and video,

Energy:

There has been little research done on the amount of energy used by compression

algorithms.

In some sensor networks, the purpose of compression is to save energy. By spending

a little energy in the CPU compressing the data, so we have fewer bytes to transmit,

we save energy in the radio -- the radio can be turned on less often, or for shorter

periods of time, or both.[

https://en.wikibooks.org/wiki/Data_Compression/Evaluating_Compression_Effectiveness#cite_note-16


Latency

Latency refers to a short period of delay (usually measured in milliseconds)

between when an audio signal enters and when it emerges from a system.

Compression adds 2 kinds of latency: compression latency and decompression

latency, both of which add to end-to-end latency.

Space: some times programmer need to know how much RAM does the

algorithm need to run?

October 12, 2015 20

consists of groups (or cells) of 3 × 2 dots each, embossed on thick paper.

Each of the 6 dots in a group may be flat or raised, implying that the

information content of a group is equivalent to 6 bits, resulting in 64 possible

groups.

Braille code

Intuitive Compression

[email protected]

October 12, 2015 21

Irreversible Text Compression

Sometimes it is acceptable to “compress” text by simply throwing away

some information.

This is called irreversible text compression or compaction. The

decompressed text will not be identical to the original, so such methods

are not general purpose; they can only be used in special cases.

[email protected]

October 12, 2015 22

Ad Hoc Text Compression

Here are some simple, intuitive ideas for cases where the compression

must be reversible (lossless).

If the text contains many spaces but they are not clustered, they may be

removed and their positions indicated by a bit-string that contains a 0 for

each text character that is not a space and a 1 for each space. Thus, the

text

Here are some ideas,

is encoded as the bit-string

“0000100010000100000”

followed by the text

Herearesomeideas.

[email protected]

October 12, 2015 23

Packing

Since ASCII codes are essentially 7 bits long, the text may be compressed by

writing 7 bits per character instead of 8 on the output stream. This may be called

packing. The compression ratio is, of course, 7/8 = 0.875.

Dictionary data

(or any list sorted lexicographically) can be compressed using the concept of front

compression. This is based on the observation that adjacent words in such a list

tend to share some of their initial characters. A word can therefore be compressed

by dropping the n characters it shares with its predecessor in the list and replacing

them with n. a a

aardvark 1ardvark

aback 1back

abaft 3ft

abandon 3ndon

abandoning 7ing

abasement 3sement

abandonment 3ndonment

abash 3sh

abated 3ted

abate 5

[email protected]

October 12, 2015 24

• Suppose we have n symbols. How many bits (as a function of n ) are

needed in to represent a symbol in binary?

– First try n a power of 2.

[email protected]

How many Bits Per Symbol?

Discussion: Non-Powers of Two

• Can we do better than a fixed length representation for non-powers of

two?

October 12, 2015 25

• Developed by Shannon in the 1940’s and 50’s.

• Attempts to explain the limits of communication using probability theory.

Information Theory

Information Theory uses the term entropy as a measure of how much

information is encoded in a message. The word entropy was borrowed from

thermodynamics, and it has a similar meaning. The higher the entropy of a

message, the more information it contains. The entropy of a symbol is

defined as the negative logarithm of its probability. To determine the

information content of a message in bits, we express the entropy using the

base 2 (may be (e) or 10) logarithm:

[email protected]

October 12, 2015 26

The entropy of the message (flow of information) is its amount of

uncertainty; it increases when the message is closer to random, and

decreases when it is less random.

We define the entropy of a random variable X, taking values in the

alphabet X as

• The base 2 logarithm measures the entropy in bits. The intuition is

that entropy describes the “compressibility” of the source.

• H is the average number of bits required to code up a symbol,

given all we know is the probability distribution of the symbols.

• H is the Shannon lower bound on the average number of bits to

code a symbol in this “source model”.

[email protected]

Entropy


The entropy H(Fi) of any particular letter Fi in a file F is

(This is the number of bits required to represent that letter using an entropy coder)

And the entropy H(F) of the entire file is the sum of the entropy of each letter in the file,

(the number of bits required to represent the entire file is the sum of the number of bits

required to represent each letter in that file)

Entropy is a measure of unpredictability. Understanding entropy not only helps you

understand data compression, but can also help you choose good passwords and

avoid easily-guessed passwords.


In terms of the number of unique possible messages n (any particular letter in

the file is one of a list n possible letters, from x1,..xn , any of which may occur

0, 1, or possibly N times)

October 12, 2015 29

Example 1. Let X ∼ {1, , 16}. Note we need 4 bits to represent the values

of X intuitively. The entropy is

Example 2. 8 horses in a race with winning probabilities

Note

Example 3. {a, b, c} with P(a) = 1/8, P(b) = 1/4, P(c) = 5/8

– inf (a) = log2(8) = 3

– inf (b) = log2(4) = 2

– inf (c) = log2(8/5) = .678

• Receiving an “a” has more information than receiving a “b” or “c”.

[email protected]

October 12, 2015 30

Theorem: (Source Code Theorem) Roughly speaking, H(X) is the

minimum rate at which we can compress a random variable X and

recover it fully.

[email protected]

Data with low entropy permit a larger compression ratio than data with

high entropy.

• Consider the message: HELLO WORLD!

– The letter L has a probability of 3/12 = 1/4 of appearing in this message. The

number of bits required to encode this symbol is -log2(1/4) = 2.

• Using our formula, -P(x) log2P(xi), the average entropy of the entire message is

3.022.

– This means that the theoretical minimum number of bits per character is 3.022.

• Theoretically, the message could be sent using only 37 bits. (3.022 12 = 36.26)

Lecture 1 introduction (Data Compression)

Science

Transcript of Lecture 1 introduction (Data Compression)