Information Theory & Source Coding

14
UCCN2043 – Lecture Notes Page 1 of 14 2.0 Information Theory and Source Coding Claude Shannon laid the foundation of information theory in 1948. His paper "A Mathematical Theory of Communication" published in Bell System Technical Journal is the basis for the entire telecommunications developments that have taken place during the last five decades. A good understanding of the concepts proposed by Shannon is a must for every budding telecommunication professional. In this chapter, Shannon's contributions to the field of modern communications will be studied. 2.1 Requirements of A Communication System In any communication system, there will be an information source that produces information in some form, and an information sink absorbs the information. The communication medium connects the source and the sink. The purpose of a communication system is to transmit the information from the source to the sink without errors. However, the communication medium always introduces some errors because of noise. The fundamental requirement of a communication system is to transmit the information without errors in spite of the noise. 2.1.1 The Communication System The block diagram of a generic communication system is shown in Figure 2.1. The information source produces symbols (such as English letters, speech, video, etc.) that are sent through the transmission medium by the transmitter. The communication medium introduces noise, and so errors are introduced in the transmitted data. At the receiving end, the receiver decodes the data and gives it to the information sink. Figure 2.1: Generic communication system

description

Source Coding

Transcript of Information Theory & Source Coding

Page 1: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 1 of 14

2.0 Information Theory and Source Coding

Claude Shannon laid the foundation of information theory in 1948. His paper "A

Mathematical Theory of Communication" published in Bell System Technical Journal is the

basis for the entire telecommunications developments that have taken place during the last

five decades.

A good understanding of the concepts proposed by Shannon is a must for every budding

telecommunication professional. In this chapter, Shannon's contributions to the field of

modern communications will be studied.

2.1 Requirements of A Communication System

In any communication system, there will be an information source that produces information

in some form, and an information sink absorbs the information. The communication medium

connects the source and the sink. The purpose of a communication system is to transmit the

information from the source to the sink without errors.

However, the communication medium always introduces some errors because of noise. The

fundamental requirement of a communication system is to transmit the information without

errors in spite of the noise.

2.1.1 The Communication System

The block diagram of a generic communication system is shown in Figure 2.1. The

information source produces symbols (such as English letters, speech, video, etc.) that are

sent through the transmission medium by the transmitter. The communication medium

introduces noise, and so errors are introduced in the transmitted data. At the receiving end,

the receiver decodes the data and gives it to the information sink.

Figure 2.1: Generic communication system

Page 2: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 2 of 14

As an example, consider an information source that produces two symbols A and B. The

transmitter codes the data into a bit stream. For example, A can be coded as 1 and B as 0. The

stream of 1's and 0's is transmitted through the medium. Because of noise, 1 may become 0 or

0 may become 1 at random places, as illustrated below:

Symbols produced: A B B A A A B A B A

Bit stream produced: 1 0 0 1 1 1 0 1 0 1

Bit stream received: 1 0 0 1 1 1 1 1 0 1

At the receiver, one bit is received in error. How to ensure that the received data can be made

error free? Shannon provides the answer. The communication system given in Figure 2.1 can

be expanded, as shown in Figure 2.2.

Figure 2.2: Generic communication system as proposed by Shannon.

As proposed by Shannon, the communication system consists of source encoder, channel

encoder and modulator at the transmitting end, and demodulator, channel decoder and source

decoder at the receiving end.

In the block diagram shown in Figure 2.2, the information source produces the symbols that

are coded using two types of coding—source encoding and channel encoding—and then

modulated and sent over the medium.

At the receiving end, the modulated signal is demodulated, and the inverse operations of

channel encoding and source encoding (channel decoding and source decoding) are

performed. Then the information is presented to the information sink. Each block is explained

below.

Information source: The information source produces the symbols. If the information source

is, for example, a microphone, the signal is in analog form. If the source is a computer, the

signal is in digital form (a set of symbols).

Page 3: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 3 of 14

Source encoder: The source encoder converts the signal produced by the information source

into a data stream. If the input signal is analog, it can be converted into digital form using an

analog-to-digital converter. If the input to the source encoder is a stream of symbols, it can be

converted into a stream of 1s and 0s using some type of coding mechanism. For instance, if

the source produces the symbols A and B, A can be coded as 1 and B as 0. Shannon's source

coding theorem tells us how to do this coding efficiently.

Source encoding is done to reduce the redundancy in the signal. Source coding techniques

can be divided into lossless encoding techniques and lossy encoding techniques. In lossy

encoding techniques, some information is lost.

In source coding, there are two types of coding—lossless coding and lossy coding. In lossless

coding, no information is lost. When we compress our computer files using a compression

technique (for instance, WinZip), there is no loss of information. Such coding techniques are

called lossless coding techniques. In lossy coding, some information is lost while doing the

source coding. As long as the loss is not significant, we can tolerate it. When an image is

converted into JPEG format, the coding is lossy coding because some information is lost.

Most of the techniques used for voice, image, and video coding are lossy coding techniques.

Note: The compression utilities we use to compress data files use lossless encoding

techniques. JPEG image compression is a lossy technique because some information

is lost.

Channel encoder: If we have to decode the information correctly, even if errors are

introduced in the medium, we need to put some additional bits in the source-encoded data so

that the additional information can be used to detect and correct the errors. This process of

adding bits is done by the channel encoder. Shannon's channel coding theorem tells us how to

achieve this.

Modulation: Modulation is a process of transforming the signal so that the signal can be

transmitted through the medium. We will discuss the details of modulation in a later chapter.

Demodulator: The demodulator performs the inverse operation of the modulator.

Channel decoder: The channel decoder analyzes the received bit stream and detects and

corrects the errors, if any, using the additional data introduced by the channel encoder.

Source decoder: The source decoder converts the bit stream into the actual information. If

analog-to-digital conversion is done at the source encoder, digital-to-analog conversion is

done at the source decoder. If the symbols are coded into 1s and 0s at the source encoder, the

bit stream is converted back to the symbols by the source decoder.

Information sink: The information sink absorbs the information.

The block diagram given in Figure 2.2 is the most important diagram for all communication

engineers. We will devote separate chapters to each of the blocks in this diagram.

Page 4: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 4 of 14

2.2 Entropy of an Information Source

What is information? How do we measure information? These are fundamental issues for

which Shannon provided the answers. We can say that we received some information if there

is "decrease in uncertainty."

Consider an information source that produces two symbols A and B. The source has sent A,

B, B, A, and now we are waiting for the next symbol. Which symbol will it produce? If it

produces A, the uncertainty that was there in the waiting period is gone, and we say that

"information" is produced. Note that we are using the term "information" from a

communication theory point of view; it has nothing to do with the "usefulness" of the

information.

Shannon proposed a formula to measure information. The information measure is called the

entropy of the source. If a source produces N symbols, and if all the symbols are equally

likely to occur, the entropy of the source is given by

H = log2 N bits/symbol

For example, assume that a source produces the English letters (in this chapter, we will refer

to the English letters A to Z and space, totaling 27, as symbols), and all these symbols will be

produced with equal probability. In such a case, the entropy is

H = log2 27 = 4.75 bits/symbol

The information source may not produce all the symbols with equal probability. For instance,

in English the letter "E" has the highest frequency (and hence highest probability of

occurrence), and the other letters occur with different probabilities. In general, if a source

produces (i)th symbol with a probability of P(i), the entropy of the source is given by

H = )(log)(2

iPiP

i

∑− bits/symbol

If a large text of English is analyzed and the probabilities of all symbols (or letters) are

obtained and substituted in the formula, then the entropy is

H = 4.07 bits/symbol

Note: Consider the following sentence: "I do not knw wheter this is undrstandble." In spite

of the fact that a number of letters are missing in this sentence, you can make out what

the sentence is. In other words, there is a lot of redundancy in the English text.

This is called the first-order approximation for calculation of the entropy of the information

source.

Page 5: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 5 of 14

In English, there is a dependence of one letter on the previous letter. For instance, the letter

‘U’ always occurs after the letter ‘Q’. If we consider the probabilities of two symbols

together (aa, ab, ac, ad,..ba, bb, and so on), then it is called the second-order approximation.

So, in second-order approximation, we have to consider the conditional probabilities of

‘digrams’ (or two symbols together). The second-order entropy of a source producing English

letters can be worked out to be

H = 3.36 bits/symbol

The third-order entropy of a source producing English letters can be worked out to be

H = 2.77 bits/symbol

(which means each combination of three letters can be represented by 2.77 bits).

As you consider the higher orders, the entropy goes down.

As another example, consider a source that produces four symbols with probabilities of 1/2,

1/4, 1/8, and 1/8, and all symbols are independent of each other. The entropy of the source is

7/4 bits/symbol.

2.3 Channel Capacity

Shannon introduced the concept of channel capacity, the limit at which data can be

transmitted through a medium. The errors in the transmission medium depend on the energy

of the signal, the energy of the noise, and the bandwidth of the channel.

Conceptually, if the bandwidth is high, we can pump more data in the channel. If the signal

energy is high, the effect of noise is reduced. According to Shannon, the bandwidth of the

channel and signal energy and noise energy are related by the formula

)1(log2

N

SWC +=

where

C is channel capacity in bits per second (bps)

W is bandwidth of the channel in Hz

S/N is the signal-to-noise power ratio (SNR). SNR generally is measured in dB using

the formula

=

)(

)(log10)(

WPowerNoise

WPowerSignaldB

N

S

The value of the channel capacity obtained using this formula is the theoretical maximum. As

an example, consider a voice-grade line for which W = 3100Hz, SNR = 30dB (i.e., the signal-

to-noise ratio is 1000:1)

bpsC 894,30)10001(log31002

=+=

Page 6: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 6 of 14

So, we cannot transmit data at a rate faster than this value in a voice-grade line. An important

point to be noted is that in the above formula, Shannon assumes only thermal noise. To

increase C, can we increase W? No, because increasing W increases noise as well, and SNR

will be reduced. To increase C, can we increase SNR? No, that results in more noise, called

inter-modulation noise. The entropy of information source and channel capacity are two

important concepts, based on which Shannon proposed his theorems.

2.4 Shannon’s Theorems

In a digital communication system, the aim of the designer is to convert any information into

a digital signal, pass it through the transmission medium and, at the receiving end, reproduce

the digital signal exactly. To achieve this objective, two important requirements are:

1. To code any type of information into digital format. Note that the world is analog—

voice signals are analog, images are analog. We need to devise mechanisms to

convert analog signals into digital format. If the source produces symbols (such as A,

B), we also need to convert these symbols into a bit stream. This coding has to be

done efficiently so that the smallest number of bits is required for coding.

2. To ensure that the data sent over the channel is not corrupted. We cannot eliminate the

noise introduced on the channels, and hence we need to introduce special coding

techniques to overcome the effect of noise.

These two aspects have been addressed by Claude Shannon in his classical paper "A

Mathematical Theory of Communication" published in 1948 in Bell System Technical

Journal, which gave the foundation to information theory. Shannon addressed these two

aspects through his source coding theorem and channel coding theorem.

Shannon's source coding theorem addresses how the symbols produced by a source have to

be encoded efficiently. Shannon's channel coding theorem addresses how to encode the data

to overcome the effect of noise.

2.4.1 Source Coding Theorem

The source coding theorem states that "the number of bits required to uniquely describe an

information source can be approximated to the information content as closely as desired."

Again consider the source that produces the English letters. The information content or

entropy is 4.07 bits/symbol. According to Shannon's source coding theorem, the symbols can

be coded in such a way that for each symbol, 4.07 bits are required. But what should be the

coding technique? Shannon does not tell us!

Shannon's theory puts only a limit on the minimum number of bits required. This is a very

important limit; all communication engineers have struggled to achieve the limit all these 50

years.

Page 7: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 7 of 14

Consider a source that produces two symbols A and B with equal probability.

Symbol Probability Code Word

A 0.5 1

B 0.5 0

The two symbols can be coded as above, A is represented by 1 and B by 0. We require 1

bit/symbol.

Now consider a source that produces these same two symbols. But instead of coding A and B

directly, we can code AA, AB, BA, BB. The probabilities of these symbols and associated

code words are shown here:

Symbol Probability Code Word

AA 0.45 0

AB 0.45 10

BA 0.05 110

BB 0.05 111

Here the strategy in assigning the code words is that the symbols with high probability are

given short code words and symbols with low probability are given long code words.

Note: Assigning short code words to high-probability symbols and long code words to low-

probability symbols results in efficient coding.

In this case, the average number of bits required per symbol can be calculated using the

formula

L = )()( iLiP

i

∑ bits/symbol

where

P(i) = Probability of the code word

L(i) = Length of the code word

For this example,

L = (1 * 0.45 + 2 * 0.45 + 3 * 0.05 + 3 * 0.05) = 1.65 bits/symbol.

The entropy of the source can be calculated to be 1.469 bits/symbol.

So, if the source produces the symbols in the following sequence:

A A B A B A A B B B

then source coding gives the bit stream

0 110 110 10 111

Page 8: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 8 of 14

This encoding scheme on an average, requires 1.65 bits/symbol. If we code the symbols

directly without taking into consideration the probabilities, the coding scheme would be

AA 00

AB 01

BA 10

BB 11

Hence, we require 2 bits/symbol. The encoding mechanism taking the probabilities into

consideration is a better coding technique. The theoretical limit of the number of bits/symbol

is the entropy, which is 1.469 bits/symbol. The entropy of the source also determines the

channel capacity.

As we keep considering the higher-order entropies, we can reduce the bits/ symbol further

and perhaps achieve the limit set by Shannon.

Based on this theory, it is estimated that English text cannot be compressed to less than 1.5

bits/symbol even if you use sophisticated coders and decoders.

This theorem provides the basis for coding information (text, voice, video) into the minimum

possible bits for transmission over a channel. More details of source coding will be covered

in section 2.5.

The source coding theorem states "the number of bits required to uniquely describe an

information source can be approximated to the information content as closely as desired."

2.4.2 Channel Coding Theorem

Shannon's channel coding theorem states that "the error rate of data transmitted over a

bandwidth limited noisy channel can be reduced to an arbitrary small amount if the

information rate is lower than the channel capacity."

This theorem is the basis for error correcting codes using which we can achieve error-free

transmission. Again, Shannon only specified that using ‘good’ coding mechanisms, we can

achieve error-free transmission, but he did not specify what the coding mechanism should be!

According to Shannon, channel coding may introduce additional delay in transmission but,

using appropriate coding techniques, we can overcome the effect of channel noise.

Page 9: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 9 of 14

Consider the example of a source producing the symbols A and B. A is coded as 1 and B as 0.

Symbols Produced A B B A B

Bit Stream 1 0 0 1 0

Now, instead of transmitting this bit stream directly, we can transmit the bit stream

111 000 000 111 000

that is, we repeat each bit three times. Now, let us assume that the received bit stream is

101 000 010 111 000

Two errors are introduced in the channel. But still, we can decode the data correctly at the

receiver because we know that the second bit should be 1 and the eighth bit should be 0

because the receiver also knows that each bit is transmitted thrice. This is error correction.

This coding is called Rate 1/3 error correcting code. Such codes that can correct the errors are

called Forward Error Correcting (FEC) codes.

Ever since Shannon published his historical paper, there has been a tremendous amount of

research in the error correcting codes. We will discuss error detection and correction in

Lesson 5 "Error Detection and Correction".

All these 50 years, communication engineers have struggled to achieve the theoretical limits

set by Shannon. They have made considerable progress. Take the case of line modems that

we use for transmission of data over telephone lines. The evolution of line modems from

V.26 (2400bps data rate, 1200Hz bandwidth), V.27 modems (4800bps data rate, 1600Hz

bandwidth), V.32 modems (9600bps data rate, 2400Hz bandwidth), and V.34 modems

(28,800bps data rate, 3400Hz bandwidth) indicates the progress in source coding and channel

coding techniques using Shannon's theory as the foundation.

Shannon's channel coding theorem states that "the error rate of data transmitted over a

bandwidth limited noisy channel can be reduced to an arbitrary small amount if the

information rate is lower than the channel capacity."

Note: Source coding is used mainly to reduce the redundancy in the signal, whereas channel

coding is used to introduce redundancy to overcome the effect of noise.

Page 10: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 10 of 14

2.5 Source Coding (Ian Glover 9.1)

There may be a number of reasons for wishing to change the form of a digital signal as

supplied by an information source prior to transmission. In the case of English language text,

for example, we start with a data source consisting of about 40 distinct symbols (the letters of

the alphabet, integers and punctuation). In principle, we could transmit such text using a

signal alphabet consisting of 40 distinct voltage waveforms. This would constitute an M-ary

system where M = 40 unique signals. It may be, however, that for one or more of the

following reasons this approach is inconvenient, difficult or impossible:

• The transmission channel may be physically unsuited to carrying such a large number of distinct signals.

• The relative frequencies (chances of occurrence) with which different source symbols

occur will vary widely. This will have the effect of making the transmission

inefficient in terms of the time it takes and/or bandwidth it requires.

• The data may need to be stored and/or processed in some way before transmission. This is most easily achieved using binary electronic devices as the storage and

processing elements.

For all these reasons, sources of digital information are almost always converted as soon as

possible into binary form, i.e. each symbol is encoded as a binary word.

After appropriate processing the binary words may then be transmitted directly, as either:

• Baseband signals or

• Bandpass signals (after going through modulation process)

or re-coded into another multi-symbol alphabet (it is unlikely that the transmitted symbols

map directly onto the original source symbols).

2.5.1 Variable Length Source Coding (Ian Glover 9.5)

We are generally interested in finding a more efficient code which represents the same

information using fewer digits on average. This results in different lengths of codeword being

used for different symbols. The problem with such variable length codes is in recognizing the

start and end of the symbols.

2.5.2 Decoding Variable Length Codewords (Ian Glover 9.5.2)

The following properties need to be considered when attempting to decode variable length

codewords:.

Page 11: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 11 of 14

(A) Unique Decoding.

This is essential if the received message is to have only a single possible meaning. Consider

an M = 4 symbol alphabet with symbols represented by binary digits as follows:

A = 0

B = 01

C = 11

D = 00

If we receive the codeword 0011, it is not known whether the transmission was

D , C or

A , A , C.

This example is not, therefore, uniquely decodable.

(B) Instantaneous Decoding.

Consider now an M = 4 symbol alphabet, with the following binary representation:

A = 0

B = 10

C = 110

D = 111

This code can be instantaneously decoded using the decision tree shown in Figure 2.3 below

since no complete codeword is a prefix of a larger codeword.

Figure 2.3: Algorithm for decision tree decoding and example of practical code tree.

Page 12: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 12 of 14

This is in contrast to the previous example where A is a prefix of both B and D. The latter

example is also a ‘comma code’ as the symbol zero indicates the end of a codeword except

for the all-ones word whose length is known. Note that we are restricted in the number of

available codewords with small numbers of bits to ensure we achieve the desired decoding

properties.

Using the representation:

A = 0

B = 01

C = 011

D = 111

the code is identical to the example just given but the bits are time reversed. It is thus still

uniquely decodable but no longer instantaneous, since early codewords are now prefixes of

later ones.

2.5.3 Variable Length Coding (Ian Glover 9.6)

Assume an M = 8 symbol source A , . . . , H having probabilities of symbol occurrence:

m A B C D E F G H

P(m) 0.1 0.18 0.4 0.05 0.06 0.1 0.07 0.04

The source entropy is given by:

H = )(log)(2

mPmP

m

∑− = 2.55 bits/symbol

If the symbols are each allocated 3 bits, comprising all the binary patterns between 000 and

111, the maximum entropy of an eight-symbol source is 8log2

= 3 bit/symbol and the source

efficiency is therefore given by:

%85%1003

55.2=×=

sourceη

Shannon–Fano coding, in which we allocate the regularly used or highly probable messages

fewer bits, as these are transmitted more often, is more efficient. The less probable messages

can then be given the longer, less efficient bit patterns. This yields an improvement in

efficiency compared with that before source coding was applied.

The improvement is not as great, however, as that obtainable with another variable length

coding scheme, namely Huffman coding.

Page 13: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 13 of 14

Huffman Coding

The Huffman coding algorithm comprises two steps – reduction and splitting. These steps can

be summarised by the following instructions:

(A) Reduction (Referring to Figure 2.4)

1) List the symbols in descending order of probability.

2) Reduce the two least probable symbols to one symbol with probability equal to

their combined probability.

3) Reorder in descending order of probability at each stage. 4) Repeat the reduction step until only two symbols remain.

(B) Splitting (Referring to Figure 2.5)

1) Assign 0 and 1 to the two final symbols and work backwards.

2) Expand or lengthen the code to cope with each successive split and, at each stage, distinguish between the two split symbols by adding another 0 and 1 respectively

to the codeword.

The result of Huffman encoding (Figures 2.4 and 2.5) of the symbols A , . . . , H in the

previous example is to allocate the symbols codewords as follows:

m C B A F G E D H

P(m) 0.40 0.18 0.10 0.10 0.07 0.06 0.05 0.04

Codewords 1 001 011 0000 0100 0101 00010 00011

The code length is now given as:

L = 1 (0.4) + 3 (0.18+0.10 ) + 4 (0.10+0.07+0.06) + 5 (0.05+0.04) = 2.61

and the code efficiency is:

%7.97%10061.2

55.2%100 =×=×=

L

H

codeη

Note that the Huffman codes are formulated to minimise the average codeword length. They

do not necessarily possess error detection properties but are uniquely, and instantaneously,

decodable, as defined in section 2.5.2.

Page 14: Information Theory & Source Coding

UCCN2043 – Lecture Notes

Page 14 of 14

Figure 2.4 Huffman coding of an eight-symbol alphabet – reduction step.

Figure 2.5 Huffman coding – allocation of the codewords to the eight symbol