Mahmoud Mostapha -...

39
COMP 562: Introduction to Machine Learning Lecture 9 : Information Theory, Bayesian Networks Mahmoud Mostapha Department of Computer Science University of North Carolina at Chapel Hill [email protected] September 26, 2018

Transcript of Mahmoud Mostapha -...

Page 1: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

COMP 562: Introduction to Machine Learning

Lecture 9 : Information Theory, Bayesian Networks

Mahmoud Mostapha

Department of Computer Science

University of North Carolina at Chapel Hill

[email protected]

September 26, 2018

Page 2: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

COMP562 - Lecture 9

Plan for today:

I Basics of Information Theory

I Conditional Independence

I Bayesian Networks

Page 3: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Information Theory

The fundamental problem of communication is that ofreproducing at one point either exactly or approximatelya message selected at another point.

(Claude Shannon, 1948)

Information Theory tackles problems such as

I communication over noisy channels

I compression

Page 4: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Information Theory

Suppose we have a noisy channel that flips bits with probability f .Let x be message sent over the noisy channel and y receivedmessage.

Page 5: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Basics of Information Theory

Suppose you communicate to your friends using one of the fourmessages a,b,c and d.

You might opt to encode messages as1

I Enc(a) = 00

I Enc(b) = 01

I Enc(c) = 10

I Enc(d) = 11

Note that the length of each message is 2 bits (notation|Enc(a)| = 2)

Regardless of the message you want to communicate we alwayshave to transmit 2 bits.

1this table is called a codebook

Page 6: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Expected/average message length

The expected length of message is

Ep[|Enc(M)|] =∑

m∈{a,b,c,d}

p(M = m)|Enc(m)|

Since |Enc(m)| = 2 for all m, on average for each message you willuse ... 2 bits

Page 7: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Basics of Information Theory

Suppose you knew something extra: the probability that aparticular message will need to be transmitted.

p(M) =

1/2, M = a1/4, M = b1/8, M = c1/8, M = d

Could you then take advantage of this information?

Page 8: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Basics of Information Theory

Use short codewords for frequent messages, longer codewords forinfrequent messages

I p(M = a) = 1/2, Enc(a) = 0

I p(M = b) = 1/4, Enc(b) = 10

I p(M = c) = 1/8, Enc(c) = 110

I p(M = a) = 1/8, Enc(d) = 111

and expected message length is

1/2 ∗ 1 + 1/4 ∗ 2 + 1/8 ∗ 3 + 1/8 ∗ 3 = 1.75.

Hence, on average we save 0.25 bits per message using the abovecodebook.

Page 9: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Entropy

You can show that the codeword length assignment that minimizesthe expected message length is − log2 p(m).

Given a random variable M distributed according to p entropy is

H(M) =∑m

p(M = m) [− log2 p(M = m)]

The entropy can be interpreted as number bits spent incommunication using the optimal codebook.

For simplicity, we may write H(p) when it is clear which randomvariable we are considering.

Page 10: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Conditional entropy

H(X ) =∑m

p(X = m) [− log2 p(M = m)]

If we have access to extra information Y then we can computeconditional entropy

H(X |Y ) =∑y

∑x

p(X = x |Y = y) [− log2 p(X = x |Y = y)]

Page 11: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Entropy – characteristics

H(p) =∑m

p(m) [− log2 p(m)]

Entropy is always non-negative

The more uniform the distribution the higher the entropy (I need 2bits for 4 messages with prob. 1/4).

The less uniform the distribution the lower the entropy (I need 0bits if I always send the same message).

Page 12: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Entropy

0 0.25 0.5 0.75 10

0.2

0.4

0.6

0.8

1

Entropy of Bernoulli distribution [p, 1−p]

bits

p(x)

Maximized for uniform distribution p = 12

H

([1

2,

1

2

])=

1

2

(− log2

1

2

)+

1

2

(− log2

1

2

)=

1

2log2 2+

1

2log2 2 = 1

Page 13: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Entropy

p 2

p1

Entropy of distribution [p1, p

2, 1−p

1−p

2]

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

H ([p1, p2, 1− p1 − p2]) = −p1 log2 p1−p2 log2 p2−(1−p1−p2) log2(1−p1−p2)

For which p1 and p2 is entropy the largest? What is the value ofentropy for that distribution?

Page 14: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Entropy recap

I Entropy is evaluated on probability distributions

H(p) = −∑m

p(M = m) log2 p(M = m)

I Given a distribution of messages, entropy tells us the leastnumber of bits we would need to encode messages tocommunicate efficiently.

I Length of a code for a message should be − log2 p(m).Sometimes this can not be achieved.2

I Entropy is maximized for uniform distributions

H([1

K. . .

1

K]) = log2 K

2If p(m) 6= 2−c we get fractional code lengths.

Page 15: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Cross-entropy

Suppose we were given message probabilities q.

We build our codebook based on q and the optimal averagemessage length is H(q).

But it turns out the q is incorrect and the true messageprobabilities are p.

Cross-entropy is the average message length under thesecircumstances

H(p, q) = −∑m

p(m) log q(m)

Page 16: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Cross-entropy

Cross entropy

H(p, q) = −∑m

p(m) log q(m)

is always greater or equal than entropy

H(p) = −∑m

p(m) log p(m).

Using wrong codebook is always incurs cost in communication –using longer codes for less frequent messages.

Page 17: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Kulback Leibler divergence

Had we known the true distribution p our average message lengthwould be

H(p) = −∑m

p(m) log p(m)

we didn’t and we are now on average using a longer message

H(p, q) = −∑m

p(m) log q(m).

How much longer?

H(p, q)− H(p) = −∑m

p(m) log q(m) +∑m

p(m) log p(m)

This difference is called Kullback-Leibler divergence

KL(p||q) = H(p, q)−H(p) = −∑m

p(m) log q(m)+∑m

p(m) log p(m)

Page 18: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

KL divergence

For discrete pdfs:

KL(p||q) =∑x

p(x) logp(x)

q(x)

For continuous pdfs:

KL(p||q) =

∫xp(x) log

p(x)

q(x)dx

Page 19: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

A couple of observations about KL-divergence

1. KL(p||q) ≥ 0

2. KL(p||q) = 0 if and only if p = q

3. it is not symmetric KL(p||q) 6= KL(q||p)

Page 20: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Summary

I Entropy as a measurement of the number of bits needed tocommunicate efficiently.

I KL-divergence as a number of bits that could be saved byusing the right distribution

I KL-divergence as a distance between distributions.

Page 21: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Maximum likelihood and KL minimization

Delta function or indicator function:

[x ] =

{1, if x is true

0, otherwise

Data/Empirical distribution:

f (x) =1

N

N∑i=1

[xi = x]

Importantly:∫xf (x)g(x)dx =

∫x

(N∑i=1

[xi = x]

)g(x)dx =

1

N

N∑i=1

g(xi )

Page 22: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Maximizing likelihood is equivalent to minimizing KL

ALL(θ) =N∑i=1

1

Nlog p(xi |θ)

KL divergence between empirical distribution f (x) and p(x|θ):

KL(f ||p) =

∫xf (x) log

f (x)

p(x)dx

=

∫x

(1

N

N∑i=1

[xi = x]

)log

f (x)

p(x)dx

=N∑i=1

1

Nlog

f (xi )

p(xi )= −

N∑i=1

1

Nlog p(xi )︸ ︷︷ ︸

−ALL(θ)

+1

Nlog

1

N︸ ︷︷ ︸const.

Page 23: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Maximizing likelihood is equivalent to minimizing KL

KL(f ||p) = −ALL(θ)

and hence

argminθ

KL(f ||p) = argmaxθ

ALL(θ) = argmaxθ

LL(θ)

Page 24: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Mutual information

Mutual information between random variables is defined as

I (X ;Y ) = KL(p(X ,Y )||p(X )p(Y )) =∑x

∑y

p(x , y) logp(x , y)

p(x)p(y)

Mutual information can be expressed as a difference of entropy andconditional entropy.

I (X ;Y ) = H(X )− H(X |Y )

Note that mutual information is symmetric

I (X ;Y ) = I (Y ;X ) = H(Y )− H(Y |X )

We will use mutual information your homework to learn Bayesiannetworks.

Page 25: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Information theory

Relevant concepts:

I Entropy

I Cross-entropy

I Kullback-Leibler divergence

I Mutual information

Page 26: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Graphical models

Suppose we observe multiple correlated variables, such as words ina document or pixels in an image.

I How can we compactly represent the joint distribution p(x|θ)?I How can we use this distribution to infer one set of variables

given another in a reasonable amount of computation time?

I How can we learn the parameters of this distribution with areasonable amount of data?

Graphical models provide an intuitive way of representing andvisualising the relationships between many variables

Page 27: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Representing knowledge through graphical models

I Nodes correspond to random variables

I Edges represent statistical dependencies between the variables

A graphical model is a way to represent a joint distribution bymaking conditional independence assumptions.

Graphical models = statistics × graph theory × computer science

Page 28: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Conditional independence

Marginal independence you are familiar with

p(X |Y ,Z ) = p(X )

p(X ,Y ) = p(X )p(Y )

Conditional independence

p(X |Y ,Z ) = p(X |Z )

p(X ,Y |Z ) = p(X |Z )p(Y |Z )

And shorthand for this relationship is

X ⊥ Y |Z

Note that the relationship is symmetric.

Page 29: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Examples of conditional independence

Shoe size ⊥ Gray hair | Age

Temperature inside ⊥ Temperature outside | Air conditioning is working

Pepsi or Coke ⊥ Sweetness | Restaurant chain

Federal funds rate ⊥ State of economy | Federal Reserve meeting notes

Dice roll outcome ⊥ Previous dice rolls | Dice is not loaded

Page 30: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Benefits of capturing conditional independence

The obvious benefit is in compact representation.

Suppose we have a probability distribution p(X ,Y ,Z ) and randomvariables X ,Y ,Z can each assume 5 states.

To represent the distribution we need to store 53 probabilities.3

If we know thatX ⊥ Z |Y

then we can write

p(X ,Y ,Z ) = p(X |Y ,Z )p(Y |Z )p(Z ) = p(X |Y )p(Y |Z )p(Z )

and instead of storing one large table we store 3 significantlysmaller tables.4

3ok 53 − 1 because they need to sum to 14124 vs 44 entries, alternatively 125 vs. 55

Page 31: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Benefits of capturing conditional independence

We can also capture such independencies in a graphical form.

X Y Z

p(X ,Y ,Z ) = p(X |Y )p(Y |Z )p(Z )

Page 32: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Bayesian networks

A directed graphical model is a graphical model whose graph is adirected acyclic graph (DAG). Also known as Bayesian network orbelief network or causal network.

In DAGs the nodes can be ordered such that parents come beforechildren. This is called a topological ordering.

Page 33: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Specifying a graphical model

Each of these nodes corresponds to a random variable.

X1 X2

X3

X4

X5

X6

X7

and each node has a conditional probability associated with it

p(Xi |Xpa(i))

where pa (i) is a list of parent nodes of node i , e.g.

p(X7|X5,X6)

Page 34: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Specifying a graphical model

X1 X2

X3

X4

X5

X6

X7

This graphical model specifies a joint distribution

p(X1,X2,X3,X4,X5,X6,X7) =∏i

p(Xi |Xpa(i))

= p(X1) p(X2|X1) p(X3|X2) p(X4|X2)

p(X5|X3) p(X6|X4) p(X7|X5,X6)

Page 35: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Another example of a graphical model

X1

X2 X3

X4 X5 X6 X7

p(X1,X2,X3,X4,X5,X6,X7) =∏i

p(Xi |Xpa(i))

= p(X1)p(X2|X1)p(X3|X1)p(X4|X2)

p(X5|X2)p(X6|X3)p(X7|X3)

Page 36: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Example: naive Bayes classifiers as Bayesian networks

A naive Bayes classifier represented asa directed graphical model. We

assume there are D = 4 features, forsimplicity. Shaded nodes are

observed, unshaded nodes are hidden.

Tree-augmented naive Bayes(TAN) classifier for D = 4

features. In general, the treetopology can change depending

on the value of y.

p(y , x) = p(y)D∏j=1

p(xj |y)

Page 37: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Example: Bayesian network for medical diagnosis

Asia Bayesian Network (Interactive Demo)

Page 38: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Example: Bayesian network on Instagram hashtags

We downloaded images and captions from Instagram that weretagged #buterfly.

These captions were converted into binary vector of hashtagpresence/absences.

Bayesian network was constructed on these data.

For HW2, you will do that and much more on a differentInstagram dataset.

Page 39: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at

Today

I Information Theory

I Bayesian Networks