Probability density estimation using Product of Conditional Experts
-
Upload
chirag-gupta -
Category
Technology
-
view
246 -
download
0
Transcript of Probability density estimation using Product of Conditional Experts
PROBABILITY DENSITY ESTIMATION
USING PRODUCT OF CONDITIONAL
EXPERTS
Project Guides:
Dr. Harish Karnick (IIT Kanpur)
Dr. Vinod Nair (MSR India)
Dr. Sundar S. (MSR India)
Submitted by:
Chirag Gupta 10212
Pulkit Jain 10543
Density Estimation
Construct an estimate of the underlying probability
distribution function from observed data
Why ?
Underlying pattern in data
Useful statistical information
Modality
Skewness
How is it done ?
The observed data is considered a set of i.i.d.
samples from the distribution
Choose a model that can estimate the underlying
probability density function
Fit the model to the observed data
Maximum Likelihood Estimation
Maximize the probability of observed data
p(x) = f(x, θ)
Maximize p(data) = p(x1) * p(x2) * … * p(xn)
Background/Previous Work
High dimensional data is modeled by combining relatively simple models
Mixture models
Combination rule consists of taking a weighted arithmetic mean of the individual distributions.
Inefficient for high-dimensional data
Product of Experts
Combination rule is to multiply the relatively simple probabilistic models and renormalize
High dimensional data is relatively easy to handle
Product Of Experts
Probability of a data point d under the model is
calculated as
fi represent the relatively simple ‘expert’
θi are the parameters of ith expert
Product of Conditional Experts
Often, the individual experts are not known
We use Product of conditional experts, wherein
each expert gives the conditional probability
Product of Conditional Experts
Conditional probability can be estimated by using
classification models that associate a probability
with the output class
Classification models like Logistic Regression, Kernel
Logistic Regression, Decision Trees can be used
A wrapper model that can take in user’s choice of
experts and build the estimate
Learning/Training
Follow gradient ascent algorithm to maximize the
average log probability
Gradient for the objective function is given as:
Learning/Training
The second term in gradient expression (arising from
the normalization term) is intractable to compute
Contrastive divergence[1] is then used to
approximate the gradient
Experts Considered
Logistic Regression (linear classification model)
Kernel Logistic Regression (non linear model)
Density Estimation: Artificial Datasets
True Distribution Distribution Learnt by
Linear Model
Distribution Learnt by
Non Linear Model
Set 1
100 examples
5 dimensions
True Distribution
Set 2
100 examples
5 dimensions
Density Estimation: Artificial Datasets
Distribution Learnt by
Non Linear Model
Distribution Learnt by
Linear Model
Density Estimation: Real Datasets
Adult (200 train + 500 test)
Density Estimation: Real Datasets
Adult (500 train + 500 test)
Density Estimation: Real Datasets
OCR (200 train + 500 test)
Density Estimation: Real Datasets
OCR (500 train + 500 test)
Density Estimation
For artificial sets, the learned distribution is close to
actual distribution
MoB clearly performs better on the training
examples than the linear and non linear model
However, PoCE model generalizes better
Higher log probability on the test set
Performance is significantly better for lesser training
examples
Application: Outlier Detection
Detect points that do not belong to a particular class of points.
If the model builds a good enough representation of the data, it should assign high probability to points that are part of the inlier class and relatively low probability to points that are outside it.
We train the model on a mix of two classes, with less examples ( < 5%) from outlier class
Test to see if outliers and inliers are given low and high probabilities respectively
Application: Outlier Detection
Application: Outlier Detection
Application: Outlier Detection
Application: Outlier Detection
Application: Outlier Detection
In 3 out of 4 cases, the outliers in test and train
data get lower average probability
This hints that outlier detection can be carried out to
some extent
We now present the precision recall curves
obtained for similar class pairs. Five outliers in
training as well as test were kept to obtain the
precision recall curves
Application: Outlier Detection
CYT – MIT
Application: Outlier Detection
CYT – MIT
Application: Outlier Detection
CYT – NUC
Application: Outlier Detection
CYT – NUC
Test Set:
All the three models perform equally well
Training Set:
KLR does as good as MoB
LR does both better and worse than MoB
Application: Outlier Detection
Future Work
Partition Calculus
Annealed Importance Sampling
Evaluate on larger Datasets
Try other experts
Decision Trees
Known for interpretability of data
References
[1] Geoffrey Hinton, Training Product of Experts by minimizing Contrastive
Divergence. 2002.
[2] Hugo Larochelle, Iain Murray, The Neural Autoregressive Distribution
Estimator. 2011
[3] KL Divergence, http://en.wikipedia.org/wiki/Kullback-Leibler
divergence
[4] Restricted Boltzmann Machines, http://en.wikipedia.org/wiki/Restricted
Boltzmann machine
[5] Logistic Regression, http://en.wikipedia.org/wiki/Logistic regression
[6] Mixture Models, http://en.wikipedia.org/wiki/Mixture model
[7] UCI Repository, http://archive.ics.uci.edu/ml/
THANK YOU