Mahmoud Mostapha -...

COMP 562: Introduction to Machine Learning

Lecture 9 : Information Theory, Bayesian Networks

Mahmoud Mostapha

Department of Computer Science

University of North Carolina at Chapel Hill

[email protected]

September 26, 2018

COMP562 - Lecture 9

Plan for today:

I Basics of Information Theory

I Conditional Independence

I Bayesian Networks

Information Theory

The fundamental problem of communication is that ofreproducing at one point either exactly or approximatelya message selected at another point.

(Claude Shannon, 1948)

Information Theory tackles problems such as

I communication over noisy channels

I compression

Information Theory

Suppose we have a noisy channel that flips bits with probability f .Let x be message sent over the noisy channel and y receivedmessage.

Basics of Information Theory

Suppose you communicate to your friends using one of the fourmessages a,b,c and d.

You might opt to encode messages as1

I Enc(a) = 00

I Enc(b) = 01

I Enc(c) = 10

I Enc(d) = 11

Note that the length of each message is 2 bits (notation|Enc(a)| = 2)

Regardless of the message you want to communicate we alwayshave to transmit 2 bits.

1this table is called a codebook

Expected/average message length

The expected length of message is

Ep[|Enc(M)|] =∑

m∈{a,b,c,d}

p(M = m)|Enc(m)|

Since |Enc(m)| = 2 for all m, on average for each message you willuse ... 2 bits


Suppose you knew something extra: the probability that aparticular message will need to be transmitted.

p(M) =

1/2, M = a1/4, M = b1/8, M = c1/8, M = d

Could you then take advantage of this information?


Use short codewords for frequent messages, longer codewords forinfrequent messages

I p(M = a) = 1/2, Enc(a) = 0

I p(M = b) = 1/4, Enc(b) = 10

I p(M = c) = 1/8, Enc(c) = 110

I p(M = a) = 1/8, Enc(d) = 111

and expected message length is

1/2 ∗ 1 + 1/4 ∗ 2 + 1/8 ∗ 3 + 1/8 ∗ 3 = 1.75.

Hence, on average we save 0.25 bits per message using the abovecodebook.

Entropy

You can show that the codeword length assignment that minimizesthe expected message length is − log2 p(m).

Given a random variable M distributed according to p entropy is

H(M) =∑m

p(M = m) [− log2 p(M = m)]

The entropy can be interpreted as number bits spent incommunication using the optimal codebook.

For simplicity, we may write H(p) when it is clear which randomvariable we are considering.

Conditional entropy

H(X ) =∑m

p(X = m) [− log2 p(M = m)]

If we have access to extra information Y then we can computeconditional entropy

H(X |Y ) =∑y

∑x

p(X = x |Y = y) [− log2 p(X = x |Y = y)]

Entropy – characteristics

H(p) =∑m

p(m) [− log2 p(m)]

Entropy is always non-negative

The more uniform the distribution the higher the entropy (I need 2bits for 4 messages with prob. 1/4).

The less uniform the distribution the lower the entropy (I need 0bits if I always send the same message).

Entropy

0 0.25 0.5 0.75 10

0.2

0.4

0.6

0.8

1

Entropy of Bernoulli distribution [p, 1−p]

bits

p(x)

Maximized for uniform distribution p = 12

H

([1

2,

1

2

])=

1

2

(− log2

1

2

)+

1

2

(− log2

1

2

)=

1

2log2 2+

1

2log2 2 = 1

Entropy

p 2

p1

Entropy of distribution [p1, p

2, 1−p

1−p

2]

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

H ([p1, p2, 1− p1 − p2]) = −p1 log2 p1−p2 log2 p2−(1−p1−p2) log2(1−p1−p2)

For which p1 and p2 is entropy the largest? What is the value ofentropy for that distribution?

Entropy recap

I Entropy is evaluated on probability distributions

H(p) = −∑m

p(M = m) log2 p(M = m)

I Given a distribution of messages, entropy tells us the leastnumber of bits we would need to encode messages tocommunicate efficiently.

I Length of a code for a message should be − log2 p(m).Sometimes this can not be achieved.2

I Entropy is maximized for uniform distributions

H([1

K. . .

1

K]) = log2 K

2If p(m) 6= 2−c we get fractional code lengths.

Cross-entropy

Suppose we were given message probabilities q.

We build our codebook based on q and the optimal averagemessage length is H(q).

But it turns out the q is incorrect and the true messageprobabilities are p.

Cross-entropy is the average message length under thesecircumstances

H(p, q) = −∑m

p(m) log q(m)

Cross-entropy

Cross entropy

H(p, q) = −∑m

p(m) log q(m)

is always greater or equal than entropy

H(p) = −∑m

p(m) log p(m).

Using wrong codebook is always incurs cost in communication –using longer codes for less frequent messages.

Kulback Leibler divergence

Had we known the true distribution p our average message lengthwould be

H(p) = −∑m

p(m) log p(m)

we didn’t and we are now on average using a longer message

H(p, q) = −∑m

p(m) log q(m).

How much longer?

H(p, q)− H(p) = −∑m

p(m) log q(m) +∑m

p(m) log p(m)

This difference is called Kullback-Leibler divergence

KL(p||q) = H(p, q)−H(p) = −∑m

p(m) log q(m)+∑m

p(m) log p(m)

KL divergence

For discrete pdfs:

KL(p||q) =∑x

p(x) logp(x)

q(x)

For continuous pdfs:

KL(p||q) =

∫xp(x) log

p(x)

q(x)dx

A couple of observations about KL-divergence

1. KL(p||q) ≥ 0

2. KL(p||q) = 0 if and only if p = q

3. it is not symmetric KL(p||q) 6= KL(q||p)

Summary

I Entropy as a measurement of the number of bits needed tocommunicate efficiently.

I KL-divergence as a number of bits that could be saved byusing the right distribution

I KL-divergence as a distance between distributions.

Maximum likelihood and KL minimization

Delta function or indicator function:

[x ] =

{1, if x is true

0, otherwise

Data/Empirical distribution:

f (x) =1

N

N∑i=1

[xi = x]

Importantly:∫xf (x)g(x)dx =

∫x

(N∑i=1

[xi = x]

)g(x)dx =

1

N

N∑i=1

g(xi )

Maximizing likelihood is equivalent to minimizing KL

ALL(θ) =N∑i=1

1

Nlog p(xi |θ)

KL divergence between empirical distribution f (x) and p(x|θ):

KL(f ||p) =

∫xf (x) log

f (x)

p(x)dx

=

∫x

(1

N

N∑i=1

[xi = x]

)log

f (x)

p(x)dx

=N∑i=1

1

Nlog

f (xi )

p(xi )= −

N∑i=1

1

Nlog p(xi )︸︷︷︸

−ALL(θ)

+1

Nlog

1

N︸︷︷︸const.

Maximizing likelihood is equivalent to minimizing KL

KL(f ||p) = −ALL(θ)

and hence

argminθ

KL(f ||p) = argmaxθ

ALL(θ) = argmaxθ

LL(θ)

Mutual information

Mutual information between random variables is defined as

I (X ;Y ) = KL(p(X ,Y )||p(X )p(Y )) =∑x

∑y

p(x , y) logp(x , y)

p(x)p(y)

Mutual information can be expressed as a difference of entropy andconditional entropy.

I (X ;Y ) = H(X )− H(X |Y )

Note that mutual information is symmetric

I (X ;Y ) = I (Y ;X ) = H(Y )− H(Y |X )

We will use mutual information your homework to learn Bayesiannetworks.

Information theory

Relevant concepts:

I Entropy

I Cross-entropy

I Kullback-Leibler divergence

I Mutual information

Graphical models

Suppose we observe multiple correlated variables, such as words ina document or pixels in an image.

I How can we compactly represent the joint distribution p(x|θ)?I How can we use this distribution to infer one set of variables

given another in a reasonable amount of computation time?

I How can we learn the parameters of this distribution with areasonable amount of data?

Graphical models provide an intuitive way of representing andvisualising the relationships between many variables

Representing knowledge through graphical models

I Nodes correspond to random variables

I Edges represent statistical dependencies between the variables

A graphical model is a way to represent a joint distribution bymaking conditional independence assumptions.

Graphical models = statistics × graph theory × computer science

Examples of conditional independence

Shoe size ⊥ Gray hair | Age

Temperature inside ⊥ Temperature outside | Air conditioning is working

Pepsi or Coke ⊥ Sweetness | Restaurant chain

Federal funds rate ⊥ State of economy | Federal Reserve meeting notes

Dice roll outcome ⊥ Previous dice rolls | Dice is not loaded

Benefits of capturing conditional independence

The obvious benefit is in compact representation.

Suppose we have a probability distribution p(X ,Y ,Z ) and randomvariables X ,Y ,Z can each assume 5 states.

To represent the distribution we need to store 53 probabilities.3

If we know thatX ⊥ Z |Y

then we can write

p(X ,Y ,Z ) = p(X |Y ,Z )p(Y |Z )p(Z ) = p(X |Y )p(Y |Z )p(Z )

and instead of storing one large table we store 3 significantlysmaller tables.4

3ok 53 − 1 because they need to sum to 14124 vs 44 entries, alternatively 125 vs. 55

Benefits of capturing conditional independence

We can also capture such independencies in a graphical form.

X Y Z

p(X ,Y ,Z ) = p(X |Y )p(Y |Z )p(Z )

Bayesian networks

A directed graphical model is a graphical model whose graph is adirected acyclic graph (DAG). Also known as Bayesian network orbelief network or causal network.

In DAGs the nodes can be ordered such that parents come beforechildren. This is called a topological ordering.

Specifying a graphical model

Each of these nodes corresponds to a random variable.

X1 X2

X3

X4

X5

X6

X7

and each node has a conditional probability associated with it

p(Xi |Xpa(i))

where pa (i) is a list of parent nodes of node i , e.g.

p(X7|X5,X6)

Specifying a graphical model

X1 X2

X3

X4

X5

X6

X7

This graphical model specifies a joint distribution

p(X1,X2,X3,X4,X5,X6,X7) =∏i

p(Xi |Xpa(i))

= p(X1) p(X2|X1) p(X3|X2) p(X4|X2)

p(X5|X3) p(X6|X4) p(X7|X5,X6)

Another example of a graphical model

X1

X2 X3

X4 X5 X6 X7

p(X1,X2,X3,X4,X5,X6,X7) =∏i

p(Xi |Xpa(i))

= p(X1)p(X2|X1)p(X3|X1)p(X4|X2)

p(X5|X2)p(X6|X3)p(X7|X3)

Example: naive Bayes classifiers as Bayesian networks

A naive Bayes classifier represented asa directed graphical model. We

assume there are D = 4 features, forsimplicity. Shaded nodes are

observed, unshaded nodes are hidden.

Tree-augmented naive Bayes(TAN) classifier for D = 4

features. In general, the treetopology can change depending

on the value of y.

p(y , x) = p(y)D∏j=1

p(xj |y)

Example: Bayesian network for medical diagnosis

Asia Bayesian Network (Interactive Demo)

https://www.bayesserver.com/examples/networks/asia

Example: Bayesian network on Instagram hashtags

We downloaded images and captions from Instagram that weretagged #buterfly.

These captions were converted into binary vector of hashtagpresence/absences.

Bayesian network was constructed on these data.

For HW2, you will do that and much more on a differentInstagram dataset.

Today

I Information Theory

I Bayesian Networks

Mahmoud Mostapha -...

Documents

Transcript of Mahmoud Mostapha -...