Mahmoud Mostapha -...
Transcript of Mahmoud Mostapha -...
COMP 562: Introduction to Machine Learning
Lecture 9 : Information Theory, Bayesian Networks
Mahmoud Mostapha
Department of Computer Science
University of North Carolina at Chapel Hill
September 26, 2018
COMP562 - Lecture 9
Plan for today:
I Basics of Information Theory
I Conditional Independence
I Bayesian Networks
Information Theory
The fundamental problem of communication is that ofreproducing at one point either exactly or approximatelya message selected at another point.
(Claude Shannon, 1948)
Information Theory tackles problems such as
I communication over noisy channels
I compression
Information Theory
Suppose we have a noisy channel that flips bits with probability f .Let x be message sent over the noisy channel and y receivedmessage.
Basics of Information Theory
Suppose you communicate to your friends using one of the fourmessages a,b,c and d.
You might opt to encode messages as1
I Enc(a) = 00
I Enc(b) = 01
I Enc(c) = 10
I Enc(d) = 11
Note that the length of each message is 2 bits (notation|Enc(a)| = 2)
Regardless of the message you want to communicate we alwayshave to transmit 2 bits.
1this table is called a codebook
Expected/average message length
The expected length of message is
Ep[|Enc(M)|] =∑
m∈{a,b,c,d}
p(M = m)|Enc(m)|
Since |Enc(m)| = 2 for all m, on average for each message you willuse ... 2 bits
Basics of Information Theory
Suppose you knew something extra: the probability that aparticular message will need to be transmitted.
p(M) =
1/2, M = a1/4, M = b1/8, M = c1/8, M = d
Could you then take advantage of this information?
Basics of Information Theory
Use short codewords for frequent messages, longer codewords forinfrequent messages
I p(M = a) = 1/2, Enc(a) = 0
I p(M = b) = 1/4, Enc(b) = 10
I p(M = c) = 1/8, Enc(c) = 110
I p(M = a) = 1/8, Enc(d) = 111
and expected message length is
1/2 ∗ 1 + 1/4 ∗ 2 + 1/8 ∗ 3 + 1/8 ∗ 3 = 1.75.
Hence, on average we save 0.25 bits per message using the abovecodebook.
Entropy
You can show that the codeword length assignment that minimizesthe expected message length is − log2 p(m).
Given a random variable M distributed according to p entropy is
H(M) =∑m
p(M = m) [− log2 p(M = m)]
The entropy can be interpreted as number bits spent incommunication using the optimal codebook.
For simplicity, we may write H(p) when it is clear which randomvariable we are considering.
Conditional entropy
H(X ) =∑m
p(X = m) [− log2 p(M = m)]
If we have access to extra information Y then we can computeconditional entropy
H(X |Y ) =∑y
∑x
p(X = x |Y = y) [− log2 p(X = x |Y = y)]
Entropy – characteristics
H(p) =∑m
p(m) [− log2 p(m)]
Entropy is always non-negative
The more uniform the distribution the higher the entropy (I need 2bits for 4 messages with prob. 1/4).
The less uniform the distribution the lower the entropy (I need 0bits if I always send the same message).
Entropy
0 0.25 0.5 0.75 10
0.2
0.4
0.6
0.8
1
Entropy of Bernoulli distribution [p, 1−p]
bits
p(x)
Maximized for uniform distribution p = 12
H
([1
2,
1
2
])=
1
2
(− log2
1
2
)+
1
2
(− log2
1
2
)=
1
2log2 2+
1
2log2 2 = 1
Entropy
p 2
p1
Entropy of distribution [p1, p
2, 1−p
1−p
2]
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
H ([p1, p2, 1− p1 − p2]) = −p1 log2 p1−p2 log2 p2−(1−p1−p2) log2(1−p1−p2)
For which p1 and p2 is entropy the largest? What is the value ofentropy for that distribution?
Entropy recap
I Entropy is evaluated on probability distributions
H(p) = −∑m
p(M = m) log2 p(M = m)
I Given a distribution of messages, entropy tells us the leastnumber of bits we would need to encode messages tocommunicate efficiently.
I Length of a code for a message should be − log2 p(m).Sometimes this can not be achieved.2
I Entropy is maximized for uniform distributions
H([1
K. . .
1
K]) = log2 K
2If p(m) 6= 2−c we get fractional code lengths.
Cross-entropy
Suppose we were given message probabilities q.
We build our codebook based on q and the optimal averagemessage length is H(q).
But it turns out the q is incorrect and the true messageprobabilities are p.
Cross-entropy is the average message length under thesecircumstances
H(p, q) = −∑m
p(m) log q(m)
Cross-entropy
Cross entropy
H(p, q) = −∑m
p(m) log q(m)
is always greater or equal than entropy
H(p) = −∑m
p(m) log p(m).
Using wrong codebook is always incurs cost in communication –using longer codes for less frequent messages.
Kulback Leibler divergence
Had we known the true distribution p our average message lengthwould be
H(p) = −∑m
p(m) log p(m)
we didn’t and we are now on average using a longer message
H(p, q) = −∑m
p(m) log q(m).
How much longer?
H(p, q)− H(p) = −∑m
p(m) log q(m) +∑m
p(m) log p(m)
This difference is called Kullback-Leibler divergence
KL(p||q) = H(p, q)−H(p) = −∑m
p(m) log q(m)+∑m
p(m) log p(m)
KL divergence
For discrete pdfs:
KL(p||q) =∑x
p(x) logp(x)
q(x)
For continuous pdfs:
KL(p||q) =
∫xp(x) log
p(x)
q(x)dx
A couple of observations about KL-divergence
1. KL(p||q) ≥ 0
2. KL(p||q) = 0 if and only if p = q
3. it is not symmetric KL(p||q) 6= KL(q||p)
Summary
I Entropy as a measurement of the number of bits needed tocommunicate efficiently.
I KL-divergence as a number of bits that could be saved byusing the right distribution
I KL-divergence as a distance between distributions.
Maximum likelihood and KL minimization
Delta function or indicator function:
[x ] =
{1, if x is true
0, otherwise
Data/Empirical distribution:
f (x) =1
N
N∑i=1
[xi = x]
Importantly:∫xf (x)g(x)dx =
∫x
(N∑i=1
[xi = x]
)g(x)dx =
1
N
N∑i=1
g(xi )
Maximizing likelihood is equivalent to minimizing KL
ALL(θ) =N∑i=1
1
Nlog p(xi |θ)
KL divergence between empirical distribution f (x) and p(x|θ):
KL(f ||p) =
∫xf (x) log
f (x)
p(x)dx
=
∫x
(1
N
N∑i=1
[xi = x]
)log
f (x)
p(x)dx
=N∑i=1
1
Nlog
f (xi )
p(xi )= −
N∑i=1
1
Nlog p(xi )︸ ︷︷ ︸
−ALL(θ)
+1
Nlog
1
N︸ ︷︷ ︸const.
Maximizing likelihood is equivalent to minimizing KL
KL(f ||p) = −ALL(θ)
and hence
argminθ
KL(f ||p) = argmaxθ
ALL(θ) = argmaxθ
LL(θ)
Mutual information
Mutual information between random variables is defined as
I (X ;Y ) = KL(p(X ,Y )||p(X )p(Y )) =∑x
∑y
p(x , y) logp(x , y)
p(x)p(y)
Mutual information can be expressed as a difference of entropy andconditional entropy.
I (X ;Y ) = H(X )− H(X |Y )
Note that mutual information is symmetric
I (X ;Y ) = I (Y ;X ) = H(Y )− H(Y |X )
We will use mutual information your homework to learn Bayesiannetworks.
Information theory
Relevant concepts:
I Entropy
I Cross-entropy
I Kullback-Leibler divergence
I Mutual information
Graphical models
Suppose we observe multiple correlated variables, such as words ina document or pixels in an image.
I How can we compactly represent the joint distribution p(x|θ)?I How can we use this distribution to infer one set of variables
given another in a reasonable amount of computation time?
I How can we learn the parameters of this distribution with areasonable amount of data?
Graphical models provide an intuitive way of representing andvisualising the relationships between many variables
Representing knowledge through graphical models
I Nodes correspond to random variables
I Edges represent statistical dependencies between the variables
A graphical model is a way to represent a joint distribution bymaking conditional independence assumptions.
Graphical models = statistics × graph theory × computer science
Conditional independence
Marginal independence you are familiar with
p(X |Y ,Z ) = p(X )
p(X ,Y ) = p(X )p(Y )
Conditional independence
p(X |Y ,Z ) = p(X |Z )
p(X ,Y |Z ) = p(X |Z )p(Y |Z )
And shorthand for this relationship is
X ⊥ Y |Z
Note that the relationship is symmetric.
Examples of conditional independence
Shoe size ⊥ Gray hair | Age
Temperature inside ⊥ Temperature outside | Air conditioning is working
Pepsi or Coke ⊥ Sweetness | Restaurant chain
Federal funds rate ⊥ State of economy | Federal Reserve meeting notes
Dice roll outcome ⊥ Previous dice rolls | Dice is not loaded
Benefits of capturing conditional independence
The obvious benefit is in compact representation.
Suppose we have a probability distribution p(X ,Y ,Z ) and randomvariables X ,Y ,Z can each assume 5 states.
To represent the distribution we need to store 53 probabilities.3
If we know thatX ⊥ Z |Y
then we can write
p(X ,Y ,Z ) = p(X |Y ,Z )p(Y |Z )p(Z ) = p(X |Y )p(Y |Z )p(Z )
and instead of storing one large table we store 3 significantlysmaller tables.4
3ok 53 − 1 because they need to sum to 14124 vs 44 entries, alternatively 125 vs. 55
Benefits of capturing conditional independence
We can also capture such independencies in a graphical form.
X Y Z
p(X ,Y ,Z ) = p(X |Y )p(Y |Z )p(Z )
Bayesian networks
A directed graphical model is a graphical model whose graph is adirected acyclic graph (DAG). Also known as Bayesian network orbelief network or causal network.
In DAGs the nodes can be ordered such that parents come beforechildren. This is called a topological ordering.
Specifying a graphical model
Each of these nodes corresponds to a random variable.
X1 X2
X3
X4
X5
X6
X7
and each node has a conditional probability associated with it
p(Xi |Xpa(i))
where pa (i) is a list of parent nodes of node i , e.g.
p(X7|X5,X6)
Specifying a graphical model
X1 X2
X3
X4
X5
X6
X7
This graphical model specifies a joint distribution
p(X1,X2,X3,X4,X5,X6,X7) =∏i
p(Xi |Xpa(i))
= p(X1) p(X2|X1) p(X3|X2) p(X4|X2)
p(X5|X3) p(X6|X4) p(X7|X5,X6)
Another example of a graphical model
X1
X2 X3
X4 X5 X6 X7
p(X1,X2,X3,X4,X5,X6,X7) =∏i
p(Xi |Xpa(i))
= p(X1)p(X2|X1)p(X3|X1)p(X4|X2)
p(X5|X2)p(X6|X3)p(X7|X3)
Example: naive Bayes classifiers as Bayesian networks
A naive Bayes classifier represented asa directed graphical model. We
assume there are D = 4 features, forsimplicity. Shaded nodes are
observed, unshaded nodes are hidden.
Tree-augmented naive Bayes(TAN) classifier for D = 4
features. In general, the treetopology can change depending
on the value of y.
p(y , x) = p(y)D∏j=1
p(xj |y)
Example: Bayesian network for medical diagnosis
Asia Bayesian Network (Interactive Demo)
Example: Bayesian network on Instagram hashtags
We downloaded images and captions from Instagram that weretagged #buterfly.
These captions were converted into binary vector of hashtagpresence/absences.
Bayesian network was constructed on these data.
For HW2, you will do that and much more on a differentInstagram dataset.
Today
I Information Theory
I Bayesian Networks