Information? & how do we measure it?cgi.csc.liv.ac.uk/~ped/COMP116/Part7.pdf · ... (0:25) log...

30
Part 7 – A few bits of Information Theory What do we mean by “Information”? & how do we measure it? “They were brusque to the world of manners, they had built their hope of heaven on the binary system and the computer, 1 and 0, Yes and No – ” Norman Mailer The Armies of the Night () 1 / 30

Transcript of Information? & how do we measure it?cgi.csc.liv.ac.uk/~ped/COMP116/Part7.pdf · ... (0:25) log...

Part 7 – A few bits of Information Theory

What do we mean by “Information”?& how do we measure it?

“They were brusque to the world of manners, they had built their hope ofheaven on the binary system and the computer, 1 and 0, Yes and No – ”Norman MailerThe Armies of the Night

() 1 / 30

Some motivating background

At the centre of most computational activity there will (eventually) arise aneed to send “data” (eg text) to another party.Consider sending an e-mail:

a. We hit a keypad on a keyboard (eg ’S ’)

b. This action results in the S being added and stored in the text.

c. The message itself will be sent through some “channel”: eg amobile-network frequency, optical cable, coaxial cable, copper wire,

d. On arrival the message is processed (ie the content is determined).

e. The text is then “displayed” in some form.

() 2 / 30

What problems arise?

a. The characters (keypads) have to be encoded (usually using binary).

b. When a message is sent through a channel:

P1. How much time will it take? (“transmission rate”)P2. What if there is interference? (“noisy” channels)

Information Theory

Information Theory is the field of Computer Science that studiesquestions such as these. It offers robust solutions that answer:

I1. How to define information content and uncertainty in information.

I2. What (if anything) limits speed and data compression.

I3. Is it always possible reliably to transmit data irrespective of noise?

I4. What steps can be taken to ensure data is received correctly?

In the final part of the module we give a very basic overview ofInformation Theory.

() 3 / 30

The Shannon Paradigm for Communication

SOURCE ENCODE

CHANNEL

DECODEDESTINATION

NOISE

(INTERFERENCE)

(eg Keyboard) (eg Modulator)

(Cable)

(eg Display) (eg Demodulator)

(eg Static,Washing Machine,Vacuum etc)

The model depicted is fromC. E. Shannon. A Mathematical Theory of Communication. Bell SystemTechnical Journal, 27:379–423,623–656, July, October 1948Shannon’s paper is the origin of Information Theory as a scientificdiscipline: it formulates the key issues of interest, the basic supportingframework and derives fundamental results in the field.

() 4 / 30

Information & Uncertainty

“The fundamental problem of communication is that of reproducing atone point either exactly or approximately a message selected at anotherpoint.” C.E. Shannon, op. cit., p. 379

What is information?

When we communicate (in speech, texting, or (proper) writing) the basicunits are words. A word is just a (recognised by some authority) sequenceof symbols from an alphabet.Thus, at the most elementary level:

C1. “communication” involves “sending” “information”

C2. “information” is provided as a sequence of symbols.

In order effectively to communicate, the receiver must be able to recogniseindividual symbols in the stream of symbols that have been sent.

() 5 / 30

Information and the element of “surprise”

Basic view of symbol stream

O

E

X

T

S

A

C

Q

S

O

Z

T

E

P

J

QY

X

T

C

S

F

Z

(SYMBOL STREAM)

SOURCE

DESTINATION

ACXQYYZTQEFOAPJSXQ...

The receiver will see a stream of symbols:some will be “more common” than others;some symbols will be quite rare.For example, in written English the letters ’E’ and ’T’ occur often; theletters ’Z’ and ’J’, however are rarer.In total a receiver will be “less surprised” to see some symbols and “moresurprised” to see others.

() 6 / 30

Surprise is information gainedInformation gained is uncertainty reduced

In English the letter ’Q’ is almost always followed by the letter ’U’.When a receiver sees the letter ’Q’ in an input stream of text, “almostsurely” the next symbol seen will be ’U’.Very few words in English begin with the letter ’X’.So if a symbol for the space character is seen (’ ’) a receiver would be“surprised” if the following symbol is an ’X’.

In summary we have an alphabet, S = {s1, s2, . . . , sn} of n symbols. Amessage is a sequence of symbols. We have two questions:

Q1. How do we measure the “uncertainty” (equivalently “degree ofsurprise” or “information content”)?

Q2. How should we “encode” messages (using binary) to achieve desirableproperties?

() 7 / 30

Messages as “sequences of events”

In a stream of symbols (as noted earlier) we might expect to see somesymbols “very frequently” and others “only rarely”.Thus we can think of attaching a probability, Pi , to each symbol si .When the receiver sees si “information” has been “gained”This is because from a state of being “unsure” what the next symbolwould be, the observer’s “uncertainty” has decreased.

ui = − log2 Pi informally “the decrease in uncertainty when si is seen”ie “the information gained from seeing si”

Very small example

If we have an alphabet in which one symbol (s1 say) is the only symbolthat ever appears then P1 = 1, so that u1 = 0.There is no uncertainty (only s1 will appear) andno increase in information (if anything is seen it can only be s1).

() 8 / 30

Shannon’s Uncertainty Formula & Limits on Source Coding

What is meant by “Source Coding”?

The symbols we use to compose messages cannot be “sent directly”.eg Alphabetic symbols need to be encoded as binary sequences.Various standards (ASCII, UNICODE etc) are used for this purpose.The (general) source coding problem has three principal questions:For a given source (stream of symbols from an alphabet)

Q1. Is it possible to compress the number of bits in the data stream?If this is possible

Q2. On average, how many bits are needed to represent any symbol?Q3. What are good algorithms that maximise compression.

A high level of compression increases “throughput” (more data sent inshorter time) but “compression” must allow the receiver to decode thetext (uncompress) “easily”.

() 9 / 30

The Uncertainty Formula & Source Entropy

Symbols appearing “indepedently”

Recall that we have introduce Pi as “the probability of seeing si in astream”. For the “simplified” development being given we assume thatthese are “independent”: ie should si appear at time T the probability ofseeing sj at time T + 1 is still Pj .Shannon derives (NOT “defines”) the“Uncertainty in a data source from alphabet S” as

H(S) = −n∑

i=1

Pi log2 Pi

H(S) is sometimes referred to as the entropy of the source S .Entropy depends on the distribution P not the alphabet S .

The Source Coding Theorem

For L(S) (“average number of bits to encode a symbol”): L(S) ≥ H(S).

() 10 / 30

Some examples

S = {A,B ,C ,D}If Pi = 0.25 for each symbol (ie each is equally likely to be seen) then

H(S) = − 4 ∗ (0.25) ∗ log2(0.25) = − log2(0.25) = 2

We need at least two bits for some symbol. Could use

A = 00 ; B = 01 ; C = 10 ; D = 11

With this scheme, sending a stream of 104857600 characters (100Mega)require 200Mbits.What if PA = 0.5, PB = 0.25, PC = PD = 0.125?We could use the same coding scheme as before.We know, however, that the entropy is now:

(−0.5 log2 0.5) + (−0.25 log2 0.25) + 2(−0.125 log2 0.125) = 1.75

() 11 / 30

Exploiting the reduction in entropy

What if we now use the code

A = 1 ; B = 01 ; C = 000 ; D = 001

Given the symbol probabilities, in our 100Mega stream we “expect” thereto be:104857600/2 = 52428800 As (50Megabits);104857600/4 = 26214400 Bs (2× 26214400, ie 50Megabits)26214400 Cs and Ds (3× 26214400 ∼ 75Megabits)In total we expect to transmit 175Megabits

In order to be effective the estimate of the relative frequency of differentsymbols must be accurate. For a “large” data file this can be determinedexactly so allowing data compression with no loss of information.It is assumed, of course, the receiver will be informed (or know) of thecode scheme used.

() 12 / 30

A Bit more on Source Coding Schemes

The Source Coding Theorem gives a lower bound on the number of bitsthat might be needed for some symbol (on average).

Coding schemes

A binary code, C , is a set of codewords

C = {c1, c2, . . . , ct}

We think of individual alphabet symbols s being assigned (coded) using aparticular C (s) = ci in C

() 13 / 30

Codes and their properties

Unique Decipherability

A code C is uniquely decipherable (ud) if every sequence of codewordsfrom C maps to exactly one sequence of alphabet symbols from S .

Instantaneous Codes

A code is instantaneous if as soon as any codeword ci in C has beenreceived in full this uniquely determines the symbol it encodes.

Prefix Codes

A code C is a prefix code if given any two distinct codewords ci and cj inC it is NOT the case that ci is identical to the first sequence of charactersin cj (and vice versa)In formal language no codeword ci is a prefix of another codeword cj .

() 14 / 30

Examples

C = {0, 01, 00}C is not a prefix code: 0 is a prefix of 00.C is not ud: the sequence 000 could be either 0 followed by 00 or 00followed by 0.C is not instantaneous: the sequence 0 could be the codeword 0 or thestart of the codeword 01

C = {0, 10}C is instantaneous, ud, and a prefix code.

3 basic properties of prefix codes

P1. Every prefix code is instantaneous.

P2. Every instantaeous code is a prefix code.

P3. Every prefix code is ud (but there are ud codes that are not prefixcodes)

() 15 / 30

Some Example Codes

Original ASCII

Lower and upper case alphabetic characters, digits, punctuation symbols,so-called “CONTROL (CTRL)” characters (128 in total) are mapped to7-bit binary codes. For example

Symbol ASCII Symbol ASCIIA 1000001 0 0110000a 1100001 9 0111001! 0100001 % 0100101

ESC 0011011 0100000

The set of all 128 binary sequences containing 7 bits defining ASCII is aprefix code.

() 16 / 30

The Source Coding Theorem Extended

L-extensions of an alphabet, S

The L-extensions of S are simply all of the source words that can be builtusing S having exactly L symbols.

Suppose that C is a (binary) prefix code for the L-extensions of S . Usingn(c) to denote the number of bits in the codeword c of C , define n(L) as

n(L) =

∑ci∈C pin(ci )

L

Here pi is the probability of the source word in the L-extension of Sencoded by ci being seen.The quantity n(L) is the “average length (number of bits)” of thecodewords in C (taking into account the likelihood of the source wordencoded actually being seen).The more a code can decrease n(L), the more efficient (increased datacompression) that code is.

() 17 / 30

The Source Coding Theorem - L-extensions & Prefix Codes

For any ud binary code which describes the L-extension of S :

n(L) ≥ HL(S) = −∑

wi∈L-extension of S

Pwi log2 Pwi

There is a prefix code that encodes the L-extension of S and attains

n(L) < HL(S) +1

L

In total:

SC1. Given any distribution on the probability of words in the L-extensionof S , there is a (readily computable) lower bound on the averagenumber of bits required in a uniquely decipherable code for thisL-extension.

SC2. There is a prefix code for the L-extension of S that matches this lowerbound (to within a “small” additive term of 1/L).

() 18 / 30

A minor(?) problem

We know such a prefix code exists but:

1. How “easy” is it to construct.

2. How “easy” is it for the receiver to apply and recover the source data?

There are (as it turns out) a number of efficient methods that given thesource alphabet S , the value of L (for forming L-extensions) and therelevant probabilities PL

i determining the likelihood of w i in theL-extension occurring will construct a set of codewords achieving theoptimal average number of bits.Furthermore there are straightforward methods that allow the coded textto be decoded.Unfortunately, these are a little bit too complicated to do justice to in fullon this module but are discussed later in the programme on COMP309.

() 19 / 30

The next stage - Channel behaviour

The preceding slides have considered one aspect of Information Theory:the elements concerning what “happens” when the sender of a message“prepares” the message for transmission.That is to say, the processes of transforming symbols into binary, howmuch compression is possible without information being lost, etc.Although it is over-simplifying, the mechanisms involved when a messageis received mirror the actions involved in sending: in essence decoding thebinary sequence to recover the text.

In the final section of this Part we look at what affects the “middle” ofthis activity: messages are sent through a “channel” (wire, ether,frequency etc).

Q1 How can the sender be “sure” the message has arrived“uncorrupted”?

Q2 How can the receiver be confident that the message received is whatwas intended to be sent?

() 20 / 30

The effect of “noisy” channels

Some examples of garbled messages

1. “Send three-and-fourpence we’re going to a dance”Field-telephone message as relayed to British army HQ (WW1).

2. Coach-hire company: “How many people do you need buses for?”Reply: “aboot six tae seven”Need to read this with the right accent(!)

The “real” messages

1. “Send reinforcements we’re going to advance”

2. “about six to seven” (ie 6–7 not 67)

Interference and noise affecting messages & communication are commonproblems: sources may be over-loud mobile phone usage in confined spaces(eg “I’m on the TRAIN!”, “I’m in the LIFT!” etc); natural phenomena (egelectrical storms); or a side-effect of electronic appliances (eg washingmachines, vacuum cleaners).

() 21 / 30

An abstract model of noisy channel behaviour

DESTINATIONdec()Y

X (in) | Y (out)enc()SOURCEU

n bits

"Noisy" ChannelX Z

m bits n bits m bits

Description

After encoding message, U with m-bits some binary sequence(enc(U) = X ) with n-bits is passed to the “channel”If “noise” is present what will emerge is some binary sequence (Y , alson-bits) which may differ from X .The word Y is decoded (dec(Y ) = Z ) and Z (m bits long) sent.We wish to “minimise” the potential for errors by encoding (in n bits)messages that use m bits.Ideally, n should be “small” relative to m: we don’t want to “pad out” themessage with “too much” “redundancy”.

() 22 / 30

Noise as the “probability of errors”

The abstraction on the previous slide, in effect, describes the action of thechannel as a function mapping (encodings, X ) of input binary words (U)to transmitted words (Y ) that decode to final texts (Z ).Given the nature of “noise” we cannot predict with certainty what willappear as Y when X is given as input.(eg we cannot tell exactly how long it will take for the person brayinginto their mobile phone to scream “I’m on the TRAIN”)

() 23 / 30

Binary Channels

We focus on a particular class of “noisy” channel where the stream ofsource symbols U are {0, 1}.A sequence of symbols, U is translated as X = enc(U) for transmissionthrough a channel which outputs Y . This is then used to giveZ = dec(Y ) as the message received.Suppose now we have values p (0 ≤ p ≤ 1) and 0 ≤ q < 0.5 for which:

U. P[U = 0] = p; P[U = 1] = 1− p. That is the input is a 0 withprobability p; 1 with probability 1− p.

Y. Given an input bit X the channel leaves this unchanged withprobability 1− q and “flips” (ie negates) this bit with probability q.

() 24 / 30

Problems when sending m bits

With the scenario of the previous slide if we send U without making anychanges (ie enc(U) = U) basic probability show us that ∼ qm bits will becorrupted by noise in the channel.If q is “large” (eg very close to 0.5) the message received (Z ) is unlikely toresemble the message sent (U).Thus, Pe the probability of error, ie that U 6= dec(enc(U)) will be “quitehigh”.

Avoiding the problem: building redundancy

Now consider an alternative:

E1. Given U, X = enc(U) just “repeats” U, say 3 times, eg if U = 0,enc(U) = 000, if U = 011, enc(U) = 000111111.

D1. Z is found by dec(Y ) where dec(Y ) reports the ith bit of Z(0 ≤ i < m) as that occurring in the majority of the positionsy3iy3i+1y3i+2, eg if X = enc(U) = 000111111 and Y = 010101110then Z = dec(Y ) = 011.

() 25 / 30

Error probabilities and “repetition coding”

In the approach described we reduce the chance of a transmitted textbeing corrupted by repeating its content and applying a “majority” votescheme to the result. Before (making no change to U) with“one-bit-at-a-time” transmission through the channel we have Pe = q.When we repeat each bit 3 times,

Pe = P[At least 2 of the 3 bits are corrupted] = 3(1− q)q2 + q3 < q

We reduce the error probability BUT at the cost of reducing the bit rateBEFORE: Bit rate with no change to U: n/n = 1; Pe = q;AFTER: Bit rate with 3-repeats: n/3n; Pe < q

The transmission progress depends on the encode-decode convention.Using e for enc() and d for dec() the pair (e, d) is called a scheme.The measure of “transmission speed” is called the scheme rate, denotedR, and describes the number of bits sent each time the channel is used.Messages with m bits are “padded” to n bits: the rate of a scheme is m/n

() 26 / 30

Channel Capacity

For the basic “noisy” channel model just described, and the approach (byrepeating bits) used to reduce errors, a natural question now arises: can apositive bit rate be achieved with error probability approaching 0?

Another key result of Shannon’s

There is a value C > 0 (called the “channel capacity) for which any bitrate R < C is “achievable”.

What is meant by “achievable”?

A bit rate, R, is achievable if we find a sequence of schemes {(en, dn)}for which en codes n bits using m bits, n/m > R and P

(n)e (the probability

of errors in the n’th scheme) approaches 0.

() 27 / 30

Channel Capacity & Shannon’s Channel Coding Theorem I

Mutual Information

The key concept underpinning deciding whether a given “transmissionrate” is “achievable” is that of “mutual information”: I (X ;Y ).We will avoid a rigorous formal development of this concept, however,I (X ;Y ) is given by

I (X ;Y ) = H(X )− H(X |Y ) = H(Y )− H(Y |X )

Here H(Z ) is entropy of the source Z and H(V |W ) the entropy of theconditional distribution V |W .This quantity depends not only on the probability of symbols beingcorrupted within the channel itself but also on the distribution governingthe appearance of X as an input, ie the PX from earlier.

() 28 / 30

Channel Capacity & Shannon’s Channel Coding Theorem II

Channel Capacity

The capacity, C , of a channel with input source distribution P isC = maxPX

I (X ;Y ).

Shannon’s Channel Coding Theorem

If C is the channel capacity

CT1. Whenever R < C the rate R is achievable for C .

CT2. Whenever R > C the rate R is not achievable for C .

() 29 / 30

Information Theory – Summary

1. The material in this part offers only an extremely introductorydiscussion of the key concerns of Information Theory:

a. Methods for “encoding” symbols as binary sequences so as to realiseproperties such “lossless data compression” (Source Coding).

b. Methods for modeling noise and techniques for reducing its effect(Channel Coding).

2. Schemes in 1(a) try to reduce the number of bits sent in a message.Schemes in 1(b) try to minimize the number of bits added in order tosend a message reliably.

3. Coding Schemes (of both types) are an important (if rather moreadvanced) area of CS. Some of these (eg Huffman Codes, Liv-ZempelCodes) feature later in the programme on COMP309.

() 30 / 30