Middle Term Exam
description
Transcript of Middle Term Exam
Middle Term Exam
• 03/04, in class
Project
• It is a team work• No more than 2 people for each team
• Define a project of your own• Otherwise, I will assign you to a “tough” project
• Important date03/23: project proposal04/27 and 04/29: presentation05/02: final report
Project Proposal
• Introduction: describe the research problem• Related wok: describe the existing approaches and their deficiency• Proposed approaches: describe your approaches and its potential
to overcome the shortcomings of existing approaches• Plan: the plan for this project (code development, data sets, and
evaluation)• Format: it should look like a research paper• The required format (both Microsoft Word and Latex) can be
downloaded from www.cse.msu.edu/~cse847/assignments/format.zip
• Warning: any submission that does not follow the format will be given zero score.
Project Report
• The same format as the proposal• Expand the proposal with detailed description of
your algorithm and evaluation results• Presentation• 25 minute presentation• 5 minute discussion
Introduction to Information Theory
Rong Jin
Information
Information knowledgeInformation: reduction in uncertaintyExample:
1. flip a coin2. roll a die#2 is more uncertain than #1• Therefore, more information is provided by the
outcome of #2 than #1
Definition of Information
Let E be some event that occurs with probability P(E). If we are told that E has occurred, then we say we have received I(E)=log2(1/P(E)) bits of information
Example:Result of a fair coin flip (log22=1 bit)
Result of a fair die roll (log26=2.585 bits)
Entropy
• A zero-memory information source S is a source that emits symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent.
• Entropy is the average amount of information in observing the output from S
Entropy
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1. 0 H(P) logk2. Measures the uniformness of a distribution P: The
further P is from uniform, the lower the entropy.3. For any other probability distribution {q1,…,qk},
A Distance Measure Between Distributions
• Kullback-Leibler distance between distributions P and Q
• 0 D(P, Q)• The smaller D(P, Q), the more Q is similar to P• Non-symmetric: D(P, Q) D(Q, P)
Mutual Information
• Indicate the amount of information shared between two random variables
• Symmetric: I(X;Y) = I(Y;X)• Zero iff X and Y are independent
Maximum Entropy
Rong Jin
Motivation
Consider a translation example• English ‘in’ French {dans, en, à, au-cours-de, pendant}• Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant)
Case 1: no prior knowledge on translation
Case 2: 30% of times either dans or en is used
Maximum Entropy Model: Motivation
• Case 3: 30% of time dans or en is used, and 50% of times dans or à is used
• Need a measure the uninformness of a distribution
Maximum Entropy Principle (MaxEnt)
• p(dans) = 0.2, p(a) = 0.3, p(en)=0.1• p(au-cours-de) = 0.2, p(pendant) = 0.2
* max ( )
where ( ) ( ) log ( ) ( ) log ( ) ( ) log ( )
( ) log ( ) ( ) log ( )
subject to
( ) ( ) 3/10
( ) ( ) 1/ 2
( ) ( ) ( ) (
PP H P
H P p dans p dans p en p en p a p a
p au course de p au course de p pendant p pendant
p dans p en
p dans p a
p dans p en p a p au cours d
) ( ) 1e p pendant
MaxEnt for Classification
Objective is to learn p(y|x)
Constraints• Appropriate normalization
MaxEnt for Classification
Constraints• Consistent with data
Feature function
Empirical mean of feature functions
Model mean of feature functions
MaxEnt for Classification
• No assumption about p(y|x) (non-parametric)• Only need the empirical mean of feature functions
MaxEnt for Classification
• Feature function
Example of Feature Functions
f1(x) I(x{dans, en})
f2(x) I(x{dans, a})
dans 1 1
en 1 0
au-cours-de 0 0
a 0 1
pendant 0 0
Empirical Average 0.3 0.5
Solution to MaxEnt
• Identical to conditional exponential model
• Solve W by maximum likelihood estimation
Iterative Scaling (IS) Algorithm
Assume
Iterative Scaling (IS) Algorithm• Compute the empirical mean for every feature and every
class
• Initialize Repeat
• Compute p(y|x) for each training example (xi, yi) using W• Compute the model mean of every feature for every class
• Update W
Iterative Scaling (IS) Algorithm• It guarantees that the likelihood function always
increases
Iterative Scaling (IS) Algorithm
• How about features that can take both positive and negative values?
• How about the sum of features is not a constant?
MaxEnt for Classification
MaxEnt for Classification