Course objectives - Montefiore Institutelwh/Info/Transp2000/introduction.pdf · Information and...

Introduction to information theory and coding

Louis WEHENKEL

University of Liege - Department of Electrical and Computer EngineeringInstitut Montefiore

• Course organization

• Course objectives

• Introduction to probabilistic reasoning

• Algebra of information measures

• Some exercises

IT 2012, slide 1

Course material :

• These slides : the slides tend to be self-explanatory; wherenecessary I have added some notes.The slides will be available from the WEB :“http://www.montefiore.ulg.ac.be/˜lwh/Cours/Info/”

• Your personal notes

• Detailed course notes (in french; centrale des cours).

• For further reading, some reference books in english

– J. Adamek,Foundations of coding, Wiley Interscience, 1991.

– T. M. Cover and J. A. Thomas,Elements of information theory, Wiley, 1991.

– R. Frey,Graphical models for machine learning and information theory, MIT Press, 1999.

– D. Hankerson, G. A. Harris, and P. D. Johnson Jr,Introduction to information theory and data compres-sion, CRC Press, 1997.

– D. J.C. MacKay,Information theory, inference, and learning algorithms, Cambridge University Press2003.

– D. Welsh,Codes and cryptography, Oxford Science Publications, 1998.

IT 2012, note 1

Course objectives

1. Introduce information theory

• Probabilistic (stochastic) systems• Reasoning under uncertainty• Quantifying information• State and discuss coding theorems

2. Give an overview of coding theory and practice

• Data compression• Error-control coding• Automatic learning and data mining

3. Illustrate ideas with a large range of practical applications

NB. It is not sure that everything in the slides will be covered during the oral course.You should read the slides and notes (especially those whichI will skip) after eachcourse and before the next course.

IT 2012, slide 2

The course aims at introducing information theory and the practical aspects of data compression and error-controlcoding. The theoretical concepts are illustrated using practical examples related to the effective storage and transmis-sion of digital and analog data. Recent developments in the field of channel coding are also discussed (Turbo-codes).

More broadly, the goal of the course is to introduce the basictechniques for reasoning under uncertainty as wellas the computational and graphical tools which are broadly used in this area. In particular, Bayesian networks anddecision trees will be introduced, as well as elements of automatic learning and data mining.

The theoretical course is complemented by a series of computer laboratories, in which the students can simulatedata sources, data transmission channels, and use various software tools for data compression, error-correction,probabilistic reasoning and data mining.

The course is addressed to engineering students (last year), which have some background in computer science,general mathematics and elementary probability theory.

The following two slides aim at clarifying the disctinctionbetween deterministic and stochastic systems.

IT 2012, note 2

Classical system (deterministic view)

Obervable

inputs x

Hidden inputs z

(perturbations)

Observable

Outputs y

System :

outputs : y = f(x,z)

Relation between inputs and

IT 2012, slide 3

Classical system theory views a system essentially as a function (or a mapping, in the mathematical sense) betweeninputs and outputs.

If the system is static, inputs and outputs are scalars (or vectors of scalars). If the system is dynamic, inputs andoutputs are temporal signals (continuous or discrete time); a dynamic system is thus viewed as a mapping betweeninput signalsand outputsignals.

In classical system theory the issue of unobserved inputs and modeling imperfection is handled through stability,sensitivity and robustness theories. In this context uncertainty is essentially modeled by subsets of possible pertur-bations.

IT 2012, note 3

Stochastic system (probabilistic view)

Observable

inputs

Hidden inputs

(perturbations)

Observable

Outputs

System :

Probabilistic relation

Outputs : P(y|x,z)P(x) P(y)

P(z)

IT 2012, slide 4

Here we use probability theory as a tool (a kind of calculus) in order to model and quantify uncertainty. Note thatthere are other possible choices (e.g. fuzzy set theory, evidence theory...) to model uncertainty, but probability theoryis the most mature and most widely accepted approach. Still,there are philosphical arguments and controversesaround the interpretation of probability in real-life : e.g. classical (objective) notion of probability vs bayesian(subjective) notion of probability.

Theory of stochastic systems is more complex and more general than deterministic system theory. Nevertheless, thepresent trend in many fileds is to use probability theory and statistics more systematically in order to build and usestochastic system models of reality. This is due to the fact that in many real-life systems, uncertainty plays a majorrole. Within this context, there are two scientific disciplines which become of growing importance for engineers :

1. Data mining and machine learning :how to build models for stochastic systems from observations.

2. Information theory and probabilistic inference : how to use stochastic systems in an optimal manner.

Examples of applications of stochastic system theory :

• Complex systems, where detailed models are intractable (biology, sociology, computer networks. . . )

• How to take decisions involving an uncertain future (justifications of investments, portfolio management) ?

• How to take decisions under partial/imperfect knowledge (medical diagnosis) ?

• Forecasting the future... (wheather, ecosystems, stock exchange. . . )

• Modeling human behavior (economics, telecommunications,computer networks, road traffic...)

• Efficient storage and transmission of digital (and analog) data

IT 2012, note 4

Information and coding theory will be the main focus of the course

1. What is it all about ?

• 2 complementary aspects⇒ Information theory : general theoretical basis⇒ Coding : compress, fight against noise, encrypt data

• Information theory⇒ Notions of data source and data transmission channel⇒ Quantitative measures of information content (or uncertainty)⇒ Properties : 2 theorems (Shannon theorems) about feasibility limits⇒ Discrete vs continuous signals

• Applications to coding⇒ How to reach feasibility limits⇒ Practical implementation aspects (gzip, Turbo-codes...)

IT 2012, slide 5

Relations between information theory and other disciplines

Informatique

théorique

Kolmogorov

deComplexité

Physiquedynamique

Thermo

Limites

ultimes

communications

Inégalités

Télé

des

Théorie

Statistique

EconomieJeux

portefeuillesTh. des

Mathématiques

Théorie

de

l’information

Probabilités

Théorèmes

limites

Ev. rares

Tests

Inf. Fisher

IT 2012, slide 6

Shannon paradigm

CHANNEL

perturbations

thermal noise

acoustic noise

read or write errors

magnetique tape

acoustic medium

SOURCE

man

computer

RECEIVER

computer

man

optic fibre

message

Message :

- sequence of symbols, analog signal (sound, image, smell. .. )

- messages are chosen atrandom

- channel perturbations arerandom

IT 2012, slide 7

The foundations of information theory where laid down by Claude E. Shannon, shortly after the end of the sec-ond worldwar in a seminal paper entitledA mathematical theory of communication (1948). In this paper, all themain theoretical ingedients of modern information theory where already present. In particular, as we will see later,Shannon formulated and provided proofs of the two main coding theorems.

Shannon theory of communication is based on the so-called Shannon paradigm, illustrated on this slide : a datasource produces a message which is sent to a receiver throughan imperfect communication channel. The possiblesource messages can generally be modeled by a sequence of symbols, chosen in some way by the source whichappears as unpredictable to the receiver. In other words, before the message has been sent, the receiver has someuncertainty about what will be the next message. It is precisely the existence of this uncertainty which makescommunication necessary (or useful) : after the message hasbeen received, the corresponding uncertainty has beenremoved. We will see later that information will be measuredby this reduction in uncertainty.

Most real life physical channels are imperfect due to the existence of some form of noise. This means that themessage sent out will arrive in a corrupted version to the receiver (some received symbols are different from thoseemitted), and again the corruption is unpredictable for thereceiver and for the source of the message.

The two main questions posed by Shannon in his early paper areas follows :

• Suppose the channel is perfect (no corruption), and supposewe have a probabilistic description (model) of thesource, what is the maximum rate of communication (source symbols per channel usage), provided that we usean appropriate source code. This problem is presently termed as thesource coding problemor as the reversibledata compression problem. We will see that the answer to this question is given by the entropy of the source.

• Suppose now that the channel is noisy, what is then the maximum rate of communication without errors betweenany source and receiver using this channel ? This is the so-called channel coding problemor error-correctioncoding problem; we will see that the answer here is thecapacityof the channel, which is the upper bound ofmutual information between input and output messages.

IT 2012, note 7

Use of source and channel coding

Source Channel ReveiverCS CC DC DS

Normalized receiver

Normalized channel

Normalized source

Source coding :remove redundancy (make message as short as possible)

Channel coding : make data tranmission reliable (fight against noise)

Nota bene :

1. Source redundancy may be useful to fight againts noise, butis not necessarily adapted tothe channel characteristics.

2. Once redundancy has been removed from the source, all sources have the same behavior(completely unpredictable behavior).

3. Channel coding : fight against channel noise without spoiling resources.

4. Coding includes conversion of alphabets.

IT 2012, slide 8

The two main results of information theory are thus the characterization of upper bounds in terms of data compressionon the one hand, and error-less communication on the other.

A further result of practical importance is that (in most, but not all situations) source and channel coding problemscan be decoupled. In other words, data compression algorithms can be designed independently from the type of datacommunication channel that will be used to transmit (or store) the data. Conversely, channel coding can be carriedout irrespectively of the type of data sources that will be used to transmit information over them. This result has ledto the partition of coding theory into its two main subparts.

Source coding aims at removing redundancy in the source messages, so as to make them appear shorter and purelyrandom. On the other hand, channel coding aims at introducing redundancy into the message, so as to make itpossible to decode the message in spite of the uncertainty introduced by the channel noise.

Because of the decomposition property, these problems are generally solved separately. However, there are examplesof situations where the decomposition breaks down (like some multi-user channels) and also situations where fromthe engineering point of view it is much easier to solve the two problems simultaneously than seperately. This lattersituation appears when the source redundancy is particularly well adapted to the channel noise (e.g. spoken naturallanguage redundancy is adapted to acoustic noise).

Examples of digital sources are : scanned images, computer files of natural language text, computer programs, binaryexecutable files. . . . From your experience, you already knowthat compression rates of such different sources maybe quite different.

Examples of channels are : AM or FM modulated radio channel, ethernet cable, magnetic storage (tape or hard-disk);computer RAM; CD-ROM. . . . Again, the characteristics of these channels may be quite different and we will seethat different coding techniques are also required.

IT 2012, note 8

Quantitative notion of information content

Information provided by a message : vague but widely used term

Aspects :

- unpredictable character of a message

- interpretation : truth, value, beauty. . .

Interpretation depends on the context, on the observer : toocomplex and probablynot appropriate as a measure...

The unpredictable character may be measured⇒ quantitative notion of information

⇒ A message carries more information if it is more unpredictable

⇒ Information quantity : decreasing function of probabilityof occurrence

Nota bene.Probability theory (not statistics) provides the main mathematical tool of informationtheory.

Notations :(Ω, E , P (·)) = probability space

IT 2012, slide 9

Information and coding.

Think of a simple coin flipping experiment (the coin is fair).How much information is gained when you learn (i) thestate of a flipped coin ; (ii) the states of two flipped coins; (iii) the outcome when a four-sided die is rolled ? Howmuch memory do you need to store these informations on a binary computer ?

Consider now the double coin flipping experiment, where the two coins are thrown together and are indistinguishableonce they have been thrown. Both coins are fair. What are the possible issues of this experiment ? What are theprobabilities of these issues ? How much information is gained when you observe any one of these issues ? Howmuch is gained in average per experiment (supposing that yourepeat it indefinitely) ? Supposing that you have tocommunicate the result to a friend through a binary channel,how could you code the outcome so that in the average,you will minimize channel use ?

IT 2012, note 9

Probability theory : definitions et notations

Probability space : triplet(Ω, E , P (·))

Ω : universe of all possible outcomes of random experiment (sample space)

ω, ω′, ωi . . . : elements ofΩ (outcomes)

E : denotes a set of subsets ofΩ (called the event space)

A, B, C . . . : elements ofE , i.e. events.

The event space models all possible observations.

An event corresponds to a logical statement aboutω

Elementary events : singletonsωi

P (·) : probability measure (distribution)

P (·) : for eachA ∈ E provides a number∈ [0, 1].

IT 2012, slide 10

Probability space : requirements

Event spaceE : must satisfy the following properties

- Ω ∈ E

- A ∈ E ⇒ ¬A ∈ E

- ∀A1, A2, . . . ∈ E (finite or countable number) :⋃

i Ai ∈ E

⇒ one says thatE is aσ-algebra

MeasureP (·) : must satisfy the following properties (Kolmogorov axioms)

- P (A) ∈ [0, 1], ∀A ∈ E

- P (Ω) = 1

- If A1, A2, . . . ∈ E andAi ∩Aj = ∅ for i 6= j : P (⋃

i Ai) =∑

i P (Ai)

⇒ one says thatP (·) is a probability measure

IT 2012, slide 11

Finite sample spaces

We will restrict ourselves to finite sample spaces :Ω is finite

We will use the maximalσ-algebra :E = 2Ω which contains all subsets ofΩ.

Conditional probability

Definition : conditional probability measure :P (A|B)= P (A∩B)

P (B)

Careful :P (A|B) defined only ifP (B) > 0.

Check that it is indeed a probability measure on(Ω, E) (Kolomogorov axioms)

Note that :P (A|B|C) = P (A|C|B)⇒ NotationP (A|B, C) = P (A|B ∩ C).

Independent events (A ⊥ B)

Definition : A ⊥ B : if P (A|B) = P (A)

If P (A) andP (B) are positive :A ⊥ B ⇔ B ⊥ A.

IT 2012, slide 12

Random variables

From a physical viewpoint, a random variable is an elementary way of perceiving(observing) outcomes.

From a mathematical viewpoint, a random variable is a function defined onΩ (thevalues of this function may be observed).

Because we restrict ourselves to finite sample spaces, all the random variables arenecessarily discrete and finite : they have a finite number of possible values.

Let us denote byX (·) a function defined onΩ and byX = X1, . . . , Xn its range(set of possible values).

We will not distinguish between

a value (sayXi) of X (·) andthe subsetω ∈ Ω|X (ω) = Xi

Thus a random variable can also be viewed as a partitionX1, . . . , Xn of Ω.

Theoretical requirement :Xi ∈ E (always true ifΩ is finite andE maximal).

IT 2012, slide 13

The random variableX (·) induces a probability measure onX = X1, . . . , Xn

PX (X = Xi)= PΩ(Xi), which we will simply denote byP (Xi).

We will denote byP (X ) the measurePX (·).

Ω

PΩ

X2

X1X3

X (·)

PX

X2

X1

X3

X

A random variable providescondensedinformation about the experiment.

IT 2012, slide 14

Some more notations . . .

X andY two discrete r.v. on(Ω, E , P (·)).

Notation :X = X1, . . . , Xn etY = Y1, . . . , Ym. (n andm finite).

Xi (resp.Yj) value ofX (resp.Y) ≡ subsets ofΩ.

⇒We identify a r.v. with the partition it induces onΩ.

Contingency table

Y1 · · · Yj · · · Ym

X1...

......

Xi · · · · · · pi,j · · · · · · pi,·...

...

Xn

...p·,j

pi,· ≡ P (Xi) ←− ∗≡ P (X = Xi)

p·,j ≡ P (Yj) ←− ∗≡ P (Y = Yj)

pi,j ≡ P (Xi ∩ Yj) ≡ P (Xi, Yj) ←− ∗≡ P ([X = Xi] ∧ [Y = Yi])

IT 2012, slide 15

Complete system of events

Reminder : event≡ subset ofΩ

(E ≡ set of all events⇒ σ-algebra)

Definition : A1, . . . An form acomplete system of eventsif- ∀i 6= j : Ai ∩Aj = ∅ (they are incompatible two by two) and if-

⋃ni Ai = Ω (they coverΩ).

Conclusion : a discrete r.v.≡ complete system of events.

Remark : we suppose thatAi 6= ∅

But, this does not imply thatP (Ai) > 0 !!!

Some authors give a slightly different definition, where thesecond condition is re-placed by :P (

⋃ni Ai) = 1.

If then⋃n

i Ai 6= Ω, one may complete such a system by adjoining one more eventAn+1 (non empty but of zero probability)

IT 2012, slide 16

Calculus of random variables

On a givenΩ we may define an arbitrary number of r.v. In reality, random variablesare the only practical way to observe outcomes of a random experiment.

Thus, a random experiment is oftendefinedby the properties of a collection of ran-dom variables.

Composition of r.v. : X (·) is a r.v. defined onΩ andY(·) is a random variabledefined onX , thenY(X (·)) is also a r.v. defined onΩ.

ConcatenationZ of X = X1, . . . , Xn andY = Y1, . . . , Ym defined onΩ :Z = X ,Y defined onΩ byZ(ω) = (X (ω),Y(ω))⇒ P (Z) = P (X ,Y).

Independenceof X = X1, . . . , Xn andY = Y1, . . . , Ym

- Iff ∀i ≤ n, j ≤ m : Xi ⊥ Yj .

- Equivalent to factorisation of probability measure :P (X ,Y) = P (X )P (Y)

- OtherwiseP (X ,Y) = P (X )P (Y|X ) = P (Y)P (X |Y)

IT 2012, slide 17

Example 1 : coin flipping

Experiment : throwing two coins at the same time.

Random variables :

- H1 ∈ T, F true if first coin falls on heads- H2 ∈ T, F true if second coin falls on heads- S ∈ T, F true if both coins fall on the same face

ThusS = (H1 ∧H2) ∨ (¬H1 ∧ ¬H2).

Coins are independent :H1 ⊥ H2, andP (S,H1,H2) = P (S|H1,H2)P (H1)P (H2).

Graphically : H2

S

H1

A first example of Bayesian network

Suppose the coins are both fair, and computeP (S).

IT 2012, slide 18

This is a very simple example (and classical) of a random experiment.

The first structural information we have is that the two coinsbehave independently (this is a very realistic, butnot perfectly true assumption). The second structural information we have is that the third random variable is a(deterministic) function of the other two, in other words its value is a causal consequence of the values of the firsttwo random variables.

Using only the structural information one can depict graphically the relationship between the three variables, asshown on the slide. We will see the precise definition of a Baysian belief network tomorrow, but this is actuallya very simple example of this very rich concept. Note that theabsence of any link between the two first variablesindicates independence graphically.

To yield a full probabilistic description, we need to specify the following three probability measures :P (H1),P (H2) andP (S|H1,H2), i.e. essentially 2 real numbers (sinceP (S|H1,H2) is already given by the functionaldescription ofS).

If the coins are identical, then we have to specify only one number, e.g.P (H1 = T ).

If the coins are fair, then we know everything, i.e.P (H1 = T ) = 0.5.

Show that if the coins are fair, we haveH1 ⊥ S ⊥ H2. Still, we don’t haveH1 ⊥ H2|S. Explain intuitively.

In general (i.e. for unfair coins) however, we don’t haveH1 ⊥ S. For example, suppose that both coins are biasedtowards heads.

You can use the “javabayes” application on the computer to play around with this example. More on this later. . .

IT 2012, note 18

Example 2 : binary source with independent symbolsωi ∈ 0, 1

P (1) : probability that the next symbol is1.

P (0) = 1− P (1) : probability that the next symbol is 0.

Let us denote byh(ω) the information provided by one symbolω.

Then : -h(ω) = f( 1P (ω)) wheref(·) is increasing and

- limx→1 f(x) = 0 (zero information if event is certain)

On the other hand (symbols are independent) :

For two successive symbolsω1, ω2 we should haveh(ω1, ω2) = h(ω1) + h(ω2).

But : h(ω1, ω2) = f( 1P (ω1,ω2)

) = f( 1P (ω1)P (ω2)

)

⇒ f(xy) = f(x) + f(y)⇒ f(·) ∝ log(·)

Definition : theself-informationprovided by the observation of an eventA ∈ E isgiven by :h(A) = − log2 P (A) Shannon

Note : h(A) ≥ 0. WhenP (A)→ 0 : h(A)→ +∞.

IT 2012, slide 19

Comments.

The theory which will be developped is not really dependent on the base used to compute logarithms. Base 2 will bethe default, and fits well with binary codes as we will see.

You should convince yourself that the definition of self-information of an event fits with the intuitive requirementsof an information measure.

It is possible to show from some explicit hypotheses that there is no other possible choice for the definition of ameasure of information (remember the way the notion of thermodynamic entropy is justified).

Nevertheless, some alternative measures of information have been proposed based on relaxing some of the require-ments and imposing some others.

To be really convinced that this measure is the right one, it is necessary to wait for the subsequent lectures, so as tosee what kind of implications this definition has.

Example : questionnaire, weighting strategies.

IT 2012, note 19

Conditional information

Let us consider an eventC = A ∩B :

We have :h(C) = h(A ∩B) = − log P (A ∩B) = − log P (A)P (B|A)= − log P (A)− log P (B|A)

One defines theconditionalself-information of the eventB given that (or supposingthat) the eventA is true :h(B|A) = − log P (B|A)

Thus, once we know thatω ∈ A, the information provided by the observation thatω ∈ B becomes− log P (B|A).

Note that :h(B|A) ≥ 0

One can write :h(A ∩B) = h(A) + h(B|A) = h(B) + h(A|B)

In particular :A ⊥ B : h(A ∩B) = h(A) + h(B)

Thus :h(A ∩B) ≥ maxh(A), h(B) ⇒monotonicity of self-information

IT 2012, slide 20

Illustration : transmission of information

- Ω = Ωi × Ωo : all possible input/output pairs of a channel-source combination.

- A : denotes the observation of a input message;B an output message.

- Linked by transition probabilityP (B|A) (stochastic channel model).

- P (A) : what a receiver can guess about the sent message before it issent (knowingonly the model of the source).

- P (A|B) : what a receiver can guess about the sent message after communicationhas happened and the output messageB has been received.

- P (B|A) represents what we can predict about the output, once we knowwhichmessage will be sent.

- Channel without noise (or deterministic) :P (B|A) associates one single possibleoutput to each input.

- For example if inputs and outputs are binary :P (ωi|ωo) = δi,o

IT 2012, slide 21

Mutual information of two events

Definition : i(A; B) = h(A)− h(A|B).

Thus :i(A; B) = log P (A|B)P (A) = log P (A∩B)

P (A)P (B) = i(B; A)

⇒mutual information is by definition symmetric.

Discussion :

h(A|B) may be larger or smaller thanh(A)

⇒mutual information may be positive, zero or negative.

It is equal to zero iff the events are independent.

Particular cases :

If A ⊃ B thenP (A|B) = 1⇒ h(A|B) = 0⇒ i(A; B) = h(A).

If A ⊃ B thenh(B) ≥ h(A).

But converse is not true...

IT 2012, slide 22

Exercise.

Let us consider the coin throwing experiment.

Suppose that the two coins are identical (not necessarily fair), and say thatp ∈ [0; 1] denotes the probability to getheads for either coin.

Compute1 the following quantities under the two assumptionsp = 12

(fair coins), andp = 1.0 (totally biased coins).

h(H1 = T ), h(H1 = F ), h(H2 = T ), h(H2 = F ).

h([H1 = T ] ∧ [H2 = T ])

h([H1 = T ] ∧ [H1 = F ])

h([S = T ]), h([S = F ]),

h([H1 = T ]|[H2 = T ])

h([H1 = T ]|[S = T ])

h([H1 = T ]|[S = T ], [H2 = T ]))

1Use base 2 for the logarithms.

IT 2012, note 22

Entropy of a memoryless time-invariant source

Source : at successive timest ∈ t0, t1 . . . sends symbolss(t) chosen from afinitealphabetS = s1, . . . , sn.

Assumption : successive symbols are independent and chosenaccording to the sameprobability measure (i.e. independant oft)⇒Memoryless and time-invariant

Notation :pi = P (s(t) = si)

Definition : Source entropy: H(S)= Eh(s) = −

∑ni=1 pi log pi

Entropy : measures average information provided by the symbols sent by the source :Shannon/symbol

If F denotes the frequency of operation of the source, thenF ·H(S) measures averageinformation per time unit : Shannon/second.

Note that, because of the law of large numbers the per-symbolinformation pro-vided by any long message produced by the source converges (almost surely) towardsH(S).

IT 2012, slide 23

Examples.

Compute the entropy (per symbol) of the following sources :

A source which always emits the same symbol.

A source which emits zeroes and ones according to two fair coin flipping processes.

How can you simulate a fair coin flipping process with a coin which is not necessarily fair ?

IT 2012, note 23

Generalization : entropy of a random variable

LetX be a discrete r.v. :X defines the partitionX1, . . . , Xn of Ω.

Entropy ofX : H(X )= −

∑ni=1 P (Xi) log P (Xi)

What if some probs. are zero :limx→0 x log x = 0 : the terms vanish by continuity.

Note :H(X ) does only depend on the valuesP (Xi)

Particular case : n = 2 (binary source, an event and its negation)

H(X ) = −p log p− (1− p) log(1− p) = H2(p) wherep denotes the probability of(any) one of the two values ofX .

Properties ofH2(p) :

H2(p) = H2(1− p)

H2(0) = H2(1) = 0

H2(0.5) = 1 etH2(p) ≤ 1

IT 2012, slide 24

p1 p2

H2(qp1 + (1 − q)p2)

qH2(p1) + (1 − q)H2(p2)

H2(p)

p

0 1

0

1

H2(p2)

H2(p1)

qp1 + (1 − q)p2

Another remarkable property : concavity (consequences will appear later).

Means that∀p1 6= p2 ∈ [0, 1], ∀q ∈]0, 1[ we have

H2(qp1 + (1− q)p2) > qH2(p1) + (1− q)H2(p2)

IT 2012, slide 25

More definitions

Suppose thatX = X1, . . . , Xn andY = Y1, . . . , Ym are two (discrete) r.v.defined on a sample spaceΩ.

Joint entropy of X andY defined by

H(X ,Y)= −

n∑

i=1

m∑

j=1

P (Xi ∩ Yj) log P (Xi ∩ Yj). (1)

Conditional entropy of X givenY defined by

H(X |Y) = −n∑

i=1

m∑

j=1

P (Xi ∩ Yj) log P (Xi|Yj). (2)

Mutual information defined by

I(X ;Y) = +n∑

i=1

m∑

j=1

P (Xi ∩ Yj) logP (Xi ∩ Yj)

P (Xi)P (Yj). (3)

IT 2012, slide 26

Note that the joint entropy is nothing novel : it is just the entropy of the random variableZ = (X ,Y). For the timebeing consider conditional entropy and mutual informationas purely mathematical definitions. The fact that thesedefinitions really make sense will become clear from the study of the properties of these measures.

Exercise (computational) :Consider our double coin flipping experiment. Suppose the coins are both fair.

ComputeH(H1), H(H2), H(S). ComputeH(H1,H2), H(H2,H1), H(H1,S), H(H2,S) andH(H1,H2,S)

ComputeH(H1|H2), H(H2|H1) and thenH(S|H1,H2) andH(H1,H2|S)

ComputeI(H1;H2), I(H2;H1) and thenI(S;H1) andI(S;H2) andI(S;H1,H2).

Experiment to work out for tomorrow.

You are given 12 balls, all of which are equal in weight exceptfor one which is either lighter or heavier. You arealso given a two-pan balance to use. In each use of the balanceyou may put any number of the 12 balls on the leftpan, and the same number (of the remaining) balls on the rightpan. The result may be one of three outcomes : equalweights on both pans; left pan heavier; right pan heavier. Your task is to design a strategy to determine which is theodd ballandwhether it is lighter or heavierin as few (expected) uses of the balance as possible.

While thinking about this problem, you should consider the following questions :

• How can you measure information ? What is the most information you can get from a single weighing ?• How much information have you gained (in average) when you have identified the odd ball and whether it is

lighter or heavier ?• What is the smallest number of weighings that might concevably be sufficient to always identify the odd ball and

whether it is heavy or light ?• As you design a strategy you can draw a tree showing for each ofthe three outcomes of a weighing what weighing

to do next. What is the probability of each of the possible outcomes of the first weighing ?

IT 2012, note 26

Properties of the functionHn(· · ·)

Notation : Hn(p1, p2, . . . , pn)= −

∑ni=1 pi log pi

( pi discrete probability distribution, i.e.pi ∈ [0, 1] and∑n

i=1 pi = 1)

Positivity : Hn(p1, p2, . . . , pn) ≥ 0 (evident)

Annulation : Hn(p1, p2, . . . , pn) = 0⇒ pi = δi,j (also evident)

Maximal : ⇔ pi = 1n, ∀i. (proof follows later)

Concavity : (proof follows)

Let (p1, p2, . . . , pn) and(q1, q2, . . . , qn) be two discrete probability distributions andλ ∈ [0, 1] then

Hn(λp1 + (1− λ)q1, . . . , λpn + (1− λ)qn)

≥ (4)λHn(p1, . . . , pn) + (1− λ)Hn(q1, . . . , qn),

IT 2012, slide 27

The very important properties of the information and entropy measures which have been introduced before, are allconsequences of the mathematical properties of the entropyfunction Hn defined and stated on the slide.

The following slides provide proofs respectively of the maximality and concavity properties which are not trivial.

In addition, let us recall the fact that the function is invariant with respect to any permutation of its arguments.

Questions :

What is the entropy of a uniform distribution ?

Relate entropy function intuitively to uncertainty and thermodynamic entropy.

Nota bene :

Entropyfunction : function defined on a (convex) subset ofIRn.

Entropymeasure: function defined on the set of random variables defined on a given sample space.

⇒ strictly speaking these are two different notions, and thatis the reason to use two different notations (Hn vsH).

IT 2012, note 27

Maximum of entropy function (proof)

Gibbs inequality : (Lemma, useful also for the sequel⇒ keep it in mind)

Formulation : let (p1, p2, . . . , pn) et (q1, q2, . . . , qn) two probability distributions.Then,

n∑

i=1

pi logqi

pi≤ 0, (5)

where equality holds if, and only if,∀i : pi = qi.

Proof : (we use the fact that :lnx ≤ x− 1, with equality⇔ x = 1, slide below)

Let us proove that∑n

i=1 pi ln qi

pi≤ 0

In lnx ≤ x − 1 replacex by qi

pi, multiply then bypi sum over indexi, which gives

(when allpi are strictly positive)∑n

i=1 pi ln qi

pi≤∑n

i=1 pi(qi

pi− 1) =

∑ni=1 qi −

∑ni=1 pi = 1− 1 = 0

Homework : convince yourself that this remains true even when some of thepi = 0.

IT 2012, slide 28

-2.

-1.5

-1.

-.5

0.0

0.5

1.

1.5

2.

0.0 0.5 1. 1.5 2. 2.5 3.

y = x − 1

y = lnx

y

x

IT 2012, slide 29

Theorem :

Hn(p1, p2, . . . , pn) ≤ log n, with equality⇔ ∀i : pi = 1n

.

Proof

Let us apply Gibbs inequality withqi = 1n

We findn∑

i=1

pi log1

npi=

n∑

i=1

pi log1

pi−

n∑

i=1

pi log n ≤ 0⇒

Hn(p1, p2, . . . , pn) =n∑

i=1

pi log1

pi≤

n∑

i=1

pi log n = log n,

where equality holds if, and only if allpi = qi = 1n

2

IT 2012, slide 30

Concavity of entropy function (proof)

Let (p1, p2, . . . , pn) and (q1, q2, . . . , qn) be two probability distributions andλ ∈[0, 1], then

Hn(λp1 + (1− λ)q1, . . . , λpn + (1− λ)qn)

≥ (6)λHn(p1, . . . , pn) + (1− λ)Hn(q1, . . . , qn),

An (apparently) more general (but logically equivalent) formulation :

Mixture of an arbitrary number of probability distributions

Hn(∑m

j=1 λjp1j , . . . ,∑m

j=1 λjpnj)

≥ (7)∑m

j=1 λjHn(p1j , . . . , pnj),

whereλj ∈ [0, 1],∑m

i=j λj = 1, et∀j :∑n

i=1 pij = 1.

IT 2012, slide 31

Graphiquely : mixing increases entropy (cf thermodynamics)

λ1 λ2 λ4 λ5λ3

p2,1

p3,1

p4,1

p1,1 p1,2

p2,2

p3,2

p4,2

p1,5

p2,5

p3,5

p4,5

→

∑

jλjp4,j

∑

jλjp3,j

∑

jλjp2,j

∑

jλjp1,j

∑mj=1 λjHn(p1j , . . . , pnj) ≤ Hn(

∑mj=1 λjp1j , . . . ,

∑mj=1 λjpnj)

Proof : f(x) = −x log x is concave on[0, 1] : f(∑m

j=1 λjxj) ≥∑m

j=1 λjf(xj).

Thus we have :Hn(∑m

j=1 λjp1j , . . . ,∑m

j=1 λjpnj) =∑n

i=1 f(∑m

j=1 λjpij)

≥∑n

i=1[∑m

j=1 λjf(pij)] =∑m

j=1 λj

∑ni=1 f(pij)

=∑m

j=1 λjHn(p1j , . . . , pnj)

IT 2012, slide 32

Another interpretation (thermodynamic)

Entropy is smaller (before mixing) Entropy is higher (aftermixing)

Isolated containers Containers communicate (at equilibrium)

blue

red

Suppose that you pick a molecule in one of the two cuves and have to guess which kind of molecule you will obtain :in both cases you can take into account in which container youpick the molecule.

⇒ there is less uncertainty in the left cuve than in the right. Compute entropies relevant to this problem.

IT 2012, note 32

(Let us open a parenthesis : notions of convexity/concavity

Convex set: a set which contains all the line segments joining any two ofits points.

C ⊂ IRp is convex (by def.) if

x, y ∈ C, λ ∈ [0, 1]⇒ λx + (1− λ)y ∈ C.

Examples :

Convexe Non convexes

In IR : intervals, semi-intervals,IR.

IT 2012, slide 33

Examples : sets of probability distributions :n = 2 etn = 3

p1

p2

p3p2

p1

1

1

1

1

1

More generally :

Linear subspaces, semi-planes are convex

Any intersection of convex sets is also convex (ex. polyhedra)

Ellipsoids :x|(x− xc)T A−1(x− xc) ≤ 1 (A positive definite)

IT 2012, slide 34

Convex functions

f(·) : IRp → IR convex on a convex subsetC of IRp if :

x, y ∈ C, λ ∈ [0, 1]⇒

f(λx + (1− λ)y) ≤ λf(x) + (1− λ)f(y).

y

x

y = f(x)

x1 x2λx1 + (1 − λ)x2

epigraph :(x, y)|x ∈ C, f(x) ≤ y

convex set⇔ f(·) convex

A sum of convex functions is also convexIf g(x, y) if convex inx then also

∫

g(x, y)dy

fi convex, thenmaxfi convexf(x) convex⇒ f(Ax + b) convex

Similar properties hold for concave functions...

IT 2012, slide 35

Strictly convex function

If equality holds only for the trivial cases

λ ∈ 0, 1 and/or x = y

(Strictly) concave function

If −f(·) is (strictly) convex.

Important properties

- A convex (or concave) function is continuous inside a convex set

- Every local mininum of a convex function is also a global minimum

Criteria :- convex iff convex epigraph- convex if second derivative (resp. Hessian) positive (resp. positive definite)

In practice : there exist powerful optimization algorithms...

IT 2012, slide 36

Notion ofconvex linear combination,

(∑m

j λjxj), with λj ≥ 0 and∑m

j λj = 1.

as a kind of weighted average of the starting points(non-negative weights)

These are the points which may be written

Associate weights⇒ center of gravity

Convex hull of some points Set of points which are

of these points

convex linear combinations

A convex hull is a convex set.

In fact, it is the smallest convex set which contains the pointsxj .

(≡ Intersection of all convex sets containing these points.)

IT 2012, slide 37

Jensen’s inequality :

If f(·) is a convex function defined onIR→ IR andX a real random variable

f(EX) ≤ Ef(X ) where, if the function is strictly convex, equality holds iff Xis constant almost surely.

Extension to vectors ...

Concave functions :≤ −→ ≥...

Particular case : convex linear combinations

Theλj ’s act as a discrete probability measure onΩ = 1, . . . , m.

And xj denotes the value ofX at pointω = i.

Hence :f(∑m

j λjxj) ≤∑m

j λjf(xj)

Let us close the parenthesis...)

IT 2012, slide 38

Let us return to the entropies.

Strictly concave functionsf(x) = −x log x andg(x) = log x.

0.0

0.1

0.2

0.3

0.4

0.5

0.0 0.25 0.5 0.75 1.

x

y

One deduces

H(X ), H(Y), H(X ,Y)are maximal for uniform distri-butions.

Also, H(X |Y) is maximal iffor eachj P (X |Yj) is uniform,which is possible only ifP (X )is uniform andX indep. ofY .

IT 2012, slide 39

The fact thatf(x) = −x log x is strictly concave is clear from the picture. Clearlylog x is also concave.

All inequalities related to the entropy function may be easily deduced from the concavity of these two functions andJensen’s inequality. For example, let us introduce here a new quantity called relative entropy or Kullback Leiblerdistance. LetP andQ be two discrete probability distributions defined on a discreteΩ = ω1, . . . , ωn. TheKullback Leibler distance (or relative entropy) ofP w.r.t. Q is defined by

D(P ||Q) =∑

ω∈Ω

P (ω) logP (ω)

Q(ω)(8)

Jensen’s inequality allows us to prove that2D(P ||Q) ≥ 0 :

−D(P ||Q) = −∑

ω∈Ω

P (ω) logP (ω)

Q(ω)=∑

ω∈Ω

P (ω) logQ(ω)

P (ω)(9)

≤ log

(

∑

ω∈Ω

P (ω)Q(ω)

P (ω)

)

= log

(

∑

ω∈Ω

Q(ω)

)

= log 1 = 0 (10)

where the inequality follows from Jensen’s inequality applied to the concave functionlog x. Because the function is

strictly concave, equality holds only ifQ(ω)P (ω)

is constant overΩ (and hence equal to 1), which justifies the name ofdistanceof the relative entropy.

2This is nothing else than Gibbs inequality, which we have already proven without using Jensen’s inequality.

IT 2012, note 39

Course objectives - Montefiore Institutelwh/Info/Transp2000/introduction.pdf · Information and...

Documents

Transcript of Course objectives - Montefiore Institutelwh/Info/Transp2000/introduction.pdf · Information and...