slavanya.weebly.com€¦ · Web viewUNIT-I. INFORMATION THEORY. 1. INTRODUCTION . Communication...
Transcript of slavanya.weebly.com€¦ · Web viewUNIT-I. INFORMATION THEORY. 1. INTRODUCTION . Communication...
UNIT-I
INFORMATION THEORY
1. INTRODUCTION
Communication theory deals with systems for transmitting information from one point to another.
Fig 1: Commuication system
2. UNCERTAINTY, INFORMATION and ENTROPY
– Any information source produces an output that is random in nature. So the source output
is modeled as a discrete random variable, X, which is represented as symbols by
S = { s0, s1, ……… sk-1 }
with probabilities P(S = sk) = pk, where k= 0,1…., k-1
– Therefore, this set of probabilities must satisfy the condition,
– The symbols emitted by the source during successive signaling intervals are statistically
independent. The source having this property is known as discrete memoryless source.
– Any information can be defined in 3 ways:
o Uncertainty (event S = sk ,occur before a process)
o Surprise (during the process )
o Information gain (after the occurrence of the event )
– The amount of information is related to the inverse of the probability of the occurrence.
– Information Gain or self information: The amount of information gained after observing
the event S = sk, which occurs with the probability pk is termed as,
I (sk) =log(1/ pk) = - log pk
– The Units of information I (sk) are determined by the base of the logarithm, which is
usually selected as 2 or e.
When the base is 2 – units are in bits
When the base is e – units are in nats (natural units)
2.1 Properties of Information:1
∑k=0
k−1p k=1
– Consider the event S = sk, describing the emission of symbols xi , s with the probability pk,
1. I (sk) =0 , for pk =1
Outcome of an event is known before it occurs, therefore no information is gained.
2. I (sk) ≥ 0 for 0≤ pk ≤ 1
The occurrence of an event S = sk either provides some or no information.
3. I (sk) > I(si) for pk < p (si)
(i.e), less probable an event is, the more information we gain when it occurs.
4. I (sk, si) = I(sk) + I(si) if sk & si are statistically independent.
2.2 ENTROPY :
– It is a measure of average information content per source symbol.
– Denoted by H(S)
H(S) =E [I(sk)]
=−∑k=0
k−1
pk log pk bits/symbol
– The quantity H(S) is called the entropy of a discrete memoryless source with source letter S.
2.2.1 Properties of Entropy:
The entropy H(S) of such a source is bounds as follows
0≤ H(S) ≤ log2 K
where K is the radix (no: of symbols) of the letter of the Source
1. H(S)=0, if & only if pk =1 , for some k
pk = 0 , otherwise
This Lower bound on entropy corresponds to no uncertainty.
2. H(S) = log2 K, if & only if pk =1/ K , for all k .
This upper bound on entropy corresponds to maximum uncertainty.
2.2.2 Entropy of binary memoryless source:
– Consider a binary source for which symbol 0 occurs with probability p0 & symbol 1 with probability p1= 1- p0.
– Successive symbols emitted by the source are memoryless (i.e) statistically independent.
– The entropy of source is ,
H(s) = -p0 log2 (p0 ) - p1 log2 ( p1)
2
= -p0 log2 (p0 ) - (1- p0) log2 (1- p0) bits. H(p0) = -p0 log2 (p0 ) - (1- p0) log2 (1- p0)
– Here H(p0) is the entropy function at condition when p0 = 0 the entropy H(s) = 0
Fig 2: Entropy Function
– From the graph we observe that,
1. When p0 = 0, then H(s) = 0
2. When p0 = 1, then H(s) = 0
3. The entropy H(s) attains its maximum value, Hmax = 1 bit, when p1 = p0 = ½
2.2.3 Extension of a Discrete Memoryless Source
– The entropy of the extended source H(Sn ) is equal to n times of H(S), the original source.
H(Sn) = nH(S)
3. INFORMATION RATE:
– If the source is emitting symbols at a fixed rate of ‘’rs’ symbols / sec, then the average
source information rate ‘R’ is defined as
R = rs. H(S) bits / sec.
4. SOURCE CODING THEOREM:
– The problem in communication is the efficient representation of data generated by a source.
– The process by which the representation is accomplished is called source encoding.
– The device that performs the representation is called source encoder.
– The primary motivation is the compression of data due to efficient representation of the
symbols.
– Requirement of source encoder:
1. The code words produced by the encoder are in binary form
2. The source code is uniquely decodable
3
Entropy
Fig 3: Source Encoding
– The output sk is converted into a block of 0s and 1s by source encoder bk.
– Consider a discrete memoryless source whose output sk is converted by the source
encoder
into the blocks of 0’s and 1’s denoted by bk.
– Let the binary code word assigned to the symbol sk by the encoder have the length lk is
measured in bits.
– We defined average code word length of source encoder as
– The parameter represents the average number of bits per source symbol used in the source
encoding process.
– Let Lmin denote the minimum possible value of and we have define the code efficiency of
the source encoder as ,
η =
with ≥ Lmin, we clearly have η ≤ 1. The source encoder is said to be efficient when η approaches unity.
– The minimum value Lmin is determined by Shannon’s First theorem (Source coding
theorem).
– Shannon’s First theorem:
o Given a discrete memoryless source of entropy H(S) , the average code word length
L for any disortionless source encoding scheme is bounded as ,
≥ H(S)
o According to the source coding theorem, the entropy H(S) is represents a
fundamental limit on the average number of bits / source symbol.
Thus Lmin = H(S), then the efficiency of a source encoder is
η =
Types of source coding:
1. Fixed length code (FLC):
4
L
L
LminL
L
L
H (S )L
Ex: To represent the 26 letters in the english alphabet using bits, that can be uniquely
represented using 5 bits (25 =32>26).
Disadvantage: FLC does not provide an efficient way of representation (coding). Since
it gives equal importance to both frequently used letters and rarely used letters.
2. Variable length code (VLC):
Frequently used source symbols – assign short code
Rarely used source symbols – assign long code
Ex: Morse code is an example for variable length code .
- alphabets and numerals are encoded into marks(.) and Spaces(_)
- For E : .
- for Q : _ _ _ . _ _
3. Data compaction (Prefix code):
– It is used to remove redundant information from the signal prior the transmission.
– This is achieved by assigning short descriptions to the most frequent outcomes of the
source output.
– Source-coding schemes that are used in data compaction are e.g. prefix coding, huffman
coding, lempel-ziv coding.
– Prefix Coding:
• Prefix of code word : Any sequence made up of the initial part of the code word
• Prefix code: A code in which no code word is the prefix of any other code word .
• The prefix code must satisfy the Kraft – McMillan inequality.
• Kraft – McMillan inequality: – A prefix code has been constructed for a discrete memorless source with
source alphabet { s0, s1, ……… sk-1 } and
its probability { p0, p1, ……… pk-1 }
and the code word for symbol sk has length lk , k= 0,1….K-1, then the code word
lengths of the code always satisfy a certain inequality known as the Kraft –
McMillan inequality.
∑k =0
K−1
2 –lk ≤ 1
Where 2 refers to the radix (no: of symbols) in the binary alphabet
Example:
Source code symbol
Probability of occurrence Code I Code II Code III
s0 0.5 0 0 0
5
s1 0.25 1 10 01s2 0.125 00 110 011
s3 0.125 11
111
0111
Fig 4: Decision Tree
From the above table we observe that,1. Code I violates the Kraft – McMillan inequality; therefore it cannot be a prefix
code.2. The Kraft – McMillan inequality is satisfied by both codes II and III; but only code
II is a prefix code. From the above table we observe that,
1. Code I violates the Kraft – McMillan inequality; therefore it cannot be a prefix code.
2. The Kraft – McMillan inequality is satisfied by both codes II and III; but only code II is a prefix code.
Kraft – McMillan inequality for above example
6
Prefix code
Instantaneous codes:
– Prefix coding has an important feature that it is always uniquely decodable.
– Prefix codes can also be referred to as instantaneous codes, meaning that the
decoding process is achieved immediately.
– Given a discrete memoryless source of entropy H(S) , the average code word length
for any disortionless source encoding scheme is bounded as ,
H(S) ≤ < H(S) +1
– For extended source code,
H(Sn) ≤ < H(Sn) +1
nH(S) ≤ < nH(S) +1
5. SHANNON- FANO’S CODING:
– It is built on top-down approach.
Procedure:
1. List the source symbols in the decreasing probabilities.
2. Partition this ensemble into almost two equi-probable groups.
3. Assign 0 to one group and 1 to other group. These form the starting code symbols of the
code.
4. Repeat the steps 2 & 3 on each of the subgroups, until the subgroups contain only one
source symbol to determine the succeeding code symbol of the code words.
5. For convenience, a code tree may be constructed and codes read off directly.
Problem:
1. Apply the Shannon Fano’s encoding procedure fort the following ensemble.
X={ x1,x2,……,.. X8} and their probabilities P = { 0.25, 0.25, 0.125, 0.125, 0.0625, 0.0625 ,
0.0625 , 0.0625 }
Solution: Step 1 Step 2 Step 3 Step 4
71110
1101
110
110
111
111
X5
X6
X7
1100
101
10
10
11
11
11
11
X3
X4
X5
X6
X7
X8
100
X1 = 0.25
X2= 0.25
X3= 0.125
X4= 0.125
X5= 0.0625
X5= 0.0625
X6= 0.0625
X7= 0.0625
X1
X2
X3
X4
X5
X6
X7
X8
0
0
1
1
1
1
1
1
X1
X2
X3
X4
X5
X6
X7
X8
00
01
L
L
LnL
Source code symbol
Probability of occurrence (pk )
Code word Length (lk )
x1 0.25 00 2
x2 0.25 01 2
x3 0.125 100 3
x4 0.125 101 3
x5 0.0625 1100 4
x6 0.0625 1101 4
x7 0.0625 1110 4
x8 0.0625 1111 4
(i) Average code word length of source encoder as
=
= p0l0 + p1l1+ p2l2 + p3l3 + p4l4+ p5l5 + p6l6 + p7l7
= 0.5 + 0.5 + 0.375 + 0.375 +0.25 + 0.25 + 0.25 + 0.25
= 2. 75
(ii) Average information content per source symbol, Entropy,
H(S) =
= 2 [0.25 log (1/.25)] + 2 [0.125 log (1/.125)] + 4 [0.0625 log (1/.0625)]
= 2.75 bits / symbol
(iii) Efficiency η =
= 1
6. HUFFMAN CODING:
8
1111X8
∑k=0
7p k lk
L
∑k=0
k−1
pk log 1pk
=∑k=0
7
pk log 1pk
H (S )L
=2.752.75
– Each symbol of a given alphabet is assigned a sequence of bitsaccording to the symbol probability.
– Huffman tree is built by bottom-up approach.
Procedure:
1. Calculate the probability of the list of symbols .
2. Source symbols are listed in order of decreasing probability.
3. The two source symbols of lowest probability a 0 and a 1. This step is referred to as a
splitting stage.
4. These two source symbols are combined into a new source symbol with probability equal to
the sum of the two original probabilities and it is placed in the list according to its new
value.
5. Recursively apply steps 3 and 4, until each symbol has become a corresponding code leaf on
a tree.
Problem:
The five symbols of the alphabet of a discrete memoryless source and their probabilities are {s0,
s1, s2, s3, s4} and {0.4, 0.2, 0.2, 0.1, 0.1} respectively. Compute the codewords of Huffman Code.
Also compute the entropy of the source.
Solution:
9
η =
η = = 0.96
The average length-code satisfied the following source coding property,
H(S) ≤ < H(S) +1
2.12 ≤ 2.2 < 3.12
6.1 Properties of Huffman Coding
Huffman coding uses longer codewords for symbols with smaller probabilities and shorter codewords for symbols that often occur.
The two longest codewords differ only in the last bit.
The codewords are prefix codes and uniquely decodable.
It should satify the shanon’s first theorem (source coding theorem) , H(S) ≤ < H(S) +1
6.2 Extended Huffman Coding:
We can encode a group symbols together and get better performance.10
H(S)
H (S )L
2.121932.2
L
L
It should satify the shanon’s first theorem, H(S) ≤ < H(S) +1
Problem:
Consider the source symbol with alphabet A= {a1,a2,a3} and the probabilities p(a1)=
0.8, p(a2)= 0.02, p(a3)= 0.18.
Solution:
11
L
7. JOINT AND CONDITONAL ENTOPY
7.1 Joint Entropy:
– The joint entropy H(X,Y) of a pair of discrete random variables (X,Y) with a joint distribution probability P(X,Y) is defined as
12
7.2 Conditional Entropy:
– It is defined as the amount information gained by transmitter when the state of receiver is known.
– It is the amount of uncertainty remaining about the channel input after the channel output has been observed.
–
– The mutual information is the average amount of information that you get about X from observing the value of Y.
– Therefore to measure the uncertainty of X based on the channel output Y is termed as,
– The conditional entropy H(X|Y) is defined as
7.3 MUTUAL INFORMATION (M.I):
– The difference H(X) – H(X|Y) represents the uncertainty about the channel input that is
resolved by observing the channel output.
– Therefore the mutual information is termed as,
I(X;Y) = H(X) – H(X|Y)
Similarly, I(Y;X) = H(Y) – H(Y|X)
H(X) - is the entropy of the channel input X
H(X|Y) – is the conditional entropy of the channel input X after observing the channel
output Y
7.3.1 Properties of Mutual information :
Property 1 : The mutual information of a channel is symmetric; (i.e)
I(X;Y) = I(Y;X)
Proof:
By multiplying with the above equation, we get
13
Joint probability P(x,y) = p(x) p(y|x) orP(x,y) = p(y) p(x|y)
(Noisy version of channel input x)
∑k=0
k−1
p( y k|x j )
I(X;Y) = H(X) – H(X|Y)
Now substitute H(X|Y) and equation 1, with the above equation, we obtain,
From the Baye’s rule for conditional probabilities,
Substitute equation 3 in 2, we get,
Hence proved.
Property 2: The mutual information is always nonnegative, (i.e) I(X;Y) ≥ 0
Proof:
From the conditional probability, P(xj|yk)= , substituting this in equation 2, we get,
By applying fundamental inequality function directly we obtain,
I(X;Y) ≥ 0
I(X;Y) ≥ 0, means we cannot lose information, on the average, by observing the output of a
Channel.
I(X;Y) = 0, means the input and the output channel are statistically independent
Property 3: The mutual information of a channel is related to the joint entropy of the
channel input and channel output by,
14
4
2
3
H(X)
1
p( x j| y k )p( x j )
=p( y k|x j )
p( y k )
p( x j , y k )p ( x j )
I(X;Y) = H(X) + H(Y) – H(X,Y)
Where the H(X,Y) is the joint entropy,
7.4 Chain Rule:
– The relationship between joint and conditional entropy is given as,
H(X,Y) = H(X) + H(Y|X)
H(Y,X) = H(Y) + H(X|Y)
Proof:
8. DISCRETE MEMORYLESS CHANNELS:
– A discrete memoryless channel is a statistical model with an input of X and output of Y
which is a noisy version of X (here both are random variables)
– In each time slot the channel accepts an input symbol X selected from a given alphabet X (x0, x1.... xj-1) and it emits an output symbol Y from an alphabet Y.
– The channel is said to be “discrete” when both of the alphabets have finite sizes.
– It is said to be “memoryless” when the current output symbol depends only on the current
input symbol and not any of the previous ones.15
Fig 3: Discrete memoryless channels
Input alphabet X = {x0, x1,...., xj-1} Output alphabet Y.= {y0, y1,...., yk-1}
Transition Probabilities:
p(yk/xj) = P(Y= yk/ X = xj) for all j and k
0 ≤ p(yk/xj) ≤ 1 for all j and k
– Also the input alphabet X and output alphabet Y need not have same size.– A discrete memoryless channel is to arrange the various transition probabilities of the
channel in the form of a matrix as follows:
– The J-by-K matrix P is called channel matrix or transition matrix.
– The fundamental property of the channel marix P, is the sum of the elements along any row
of the matrix is always equal to 1.
for all j
– The joint probability distribution of the random variables X and Y is given by
p(xj ,yk) = P( X =xj , Y = yk )
= P(Y= yk | X= xj) p(xj ) 8.1
= p (xj | yk ) p(xj )
– The marginal probability distribution of the output random variable Y is obtained by
averaging out the dependence of p(xj ,yk) on xj as shown by
p(yk ) = P(Y= yk )
16
for k=0,1,.... K-1 8.2
∑k=0
k-1p(yk/xj)= 1
=∑j=0
J-1
P (Y= yk | X= xj ) p (xj )
=∑j=0
J-1
p(yk | xj ) p (xj )
– The probabilities p(xj ) for j= 0, 1,…….J-1, are known as the a priori probabilities of the
various input symbols.
– The equation 8.2 states that the inputs are a priori probabilities p(xj ) and the channel matrix
p (xj | yk ), and the output is p(yk ).
9. CHANNEL CAPACITY :
– Capacity in the channel is defined as a intrinsic ability of a channel to convey information
– The channel capacity of a discrete memoryless channel is a maximum mutual information
I(X;Y) in any single use of the channel, where the maximization is overall possible input
probability distribution { p(xj)} on X.
– Channel capacity is denoted by C
C=max I(X;Y)
– The Channel capacity C is measured in bits per channel use or bits per transmission.
9.1 BINARY SYMMETRIC CHANNEL:
– It is the special case of the discrete memoryless channel with J=K=2.
– The channel has two input symbols (x0 = 0, x1 = 1) and two output symbols (y0 = 0, y1 =1 )
– The channel is symmetric because the probability of receiving a 1 if a 0 is sent is the same
as the probability of receiving a 0 if a 1 is sent .
– Conditional probability of error is denoted by p.
Fig 3: Transition Probability diagram of Binary symmetric channel
Channel Capacity for Binary Symmetric channel:
– Consider the binary symmetric channel which is described by the transition probability
diagram fig 3.
– This diagram is defined by the conditional probability of error p.
– The entropy H(X) is maximized when the channel input probability p(x0)=p(x1)= 1/2.
– The mutual information I(X;Y)is similarly maximized, so that it can be written as
C=max I(X;Y) | p(x0)=p(x1)= ½
– From fig 3, p(y0 | x1) = p(y1 | x0) = p and
p(y0 | x0) = p(y1 | x1) = 1- p17
9.1{p(xj)}
– Substituting these channel transition probabilities into the below equation
with J= K= 2 and then setting the input probability p(x0) =p(x1) in accordance with the
equation 9.1
– From that the capacity of the binary symmetric channel is,
C= 1+p log2 p + (1-p) log2 (1- p)
– By using the entropy function given in the below equation
H(p0) = -p0 log2 (p0 ) - (1- p0) log2 (1- p0) equation 9.2 can be reduced as
C= 1- H(p)
– The Channel capacity for binary symmetric channel is C= 1- H(p)
– The Channel capacity C varies with the probability of error p as shown in fig 4.
Observations:
When p=0, the channel is noise free. (i.e) the channel capacity C attains its maximum
value of 1 bit per channel use, which is exactly the information in each channel input. At
this value of p, the entropy function H(p) attains its minimum value of zero.
When p=1/2 due to noise, the channel capacity C attains its minimum value of 0,
whereas H(p) attains its maximum value of one. In such a case the channel is said to be
useless.
9.2 CHANNEL CODING THEOREM ( SHANNON’S SECOND THEOREM):
– The design goal of channel coding is to increase the resistance of the communication
systems to channel noise.
– Channel coding consists of mapping the incoming data sequence into a channel input
sequence and inverse mapping the channel output sequence into an output data sequence,
so that the channel noise of the system is minimized.
18
9.2
– Mapping and inverse mapping operations are performed by encoders and decoders.
– The channel encoder and decoder should be designed to optimize the overall reliability of a
communication system.
– Block Codes: Message sequence is divided into sequential block,each ’k’ bit long
– Code Rate: Each ’k’ bit block is mapped into an ’n’ bit block by the channel coder, where
n>k. The ratio r=k/n is called as code rate,
where k=’k bit block, n=block length and r is less than unity.
– Discrete memoryless source has a source alphabet S and entropy H(S) bits/ source symbol.
The source emits the symbol once every Ts seconds. Hence the average information rate
of the source is H(s)/Ts bits / seconds
– Discrete memoryless source has channel capacity equal to ‘C’ bits per use of the channel.
– The channel is capable of being used once every Tc seconds. Hence the channel capacity
per unit time is C/Tc bits/ seconds, which represents the maximum rate of information
transfer over the channel.
– The channel coding theorem for a discrete memoryless channel is stated into 2 parts:
1. H(s)/Ts ≤ C/ Tc ,where C/ Tc is called critical rate. It states the source output can be
transmitted over the channel and be reconstructed with the small probability of
error.
2. H(s)/Ts > C/ Tc , it shows that the the source output is not possible to transmit over
the channel.
C=channel capacity, Ts &Tc =Time, H(s) =Entropy.
Draw backs:
It won’t show us how to construct a good code.
It doesn’t have a precise result for the probability of symbol error after decoding the channel
output.
11. SHANNON LIMIT:
– Shannon showed that any communications channel such as a telephone line, a radio band, a
fiber-optic cable could be characterized by two factors:
1. bandwidth 2. noise
19
– Bandwidth is the range of electronic, optical or electromagnetic frequencies that can be used
to transmit a signal;
– Noise is anything that can disturb that signal.
– Given a channel with particular bandwidth and noise characteristics, Shannon showed how
to calculate the maximum rate at which data can be sent without error.
– This rate is called as Shannon Limit.
20