Probability density estimation using Product of Conditional Experts

31
PROBABILITY DENSITY ESTIMATION USING PRODUCT OF CONDITIONAL EXPERTS Project Guides: Dr. Harish Karnick (IIT Kanpur) Dr. Vinod Nair (MSR India) Dr. Sundar S. (MSR India) Submitted by: Chirag Gupta 10212 Pulkit Jain 10543

Transcript of Probability density estimation using Product of Conditional Experts

Page 1: Probability density estimation using Product of Conditional Experts

PROBABILITY DENSITY ESTIMATION

USING PRODUCT OF CONDITIONAL

EXPERTS

Project Guides:

Dr. Harish Karnick (IIT Kanpur)

Dr. Vinod Nair (MSR India)

Dr. Sundar S. (MSR India)

Submitted by:

Chirag Gupta 10212

Pulkit Jain 10543

Page 2: Probability density estimation using Product of Conditional Experts

Density Estimation

Construct an estimate of the underlying probability

distribution function from observed data

Why ?

Underlying pattern in data

Useful statistical information

Modality

Skewness

Page 3: Probability density estimation using Product of Conditional Experts

How is it done ?

The observed data is considered a set of i.i.d.

samples from the distribution

Choose a model that can estimate the underlying

probability density function

Fit the model to the observed data

Maximum Likelihood Estimation

Maximize the probability of observed data

p(x) = f(x, θ)

Maximize p(data) = p(x1) * p(x2) * … * p(xn)

Page 4: Probability density estimation using Product of Conditional Experts

Background/Previous Work

High dimensional data is modeled by combining relatively simple models

Mixture models

Combination rule consists of taking a weighted arithmetic mean of the individual distributions.

Inefficient for high-dimensional data

Product of Experts

Combination rule is to multiply the relatively simple probabilistic models and renormalize

High dimensional data is relatively easy to handle

Page 5: Probability density estimation using Product of Conditional Experts

Product Of Experts

Probability of a data point d under the model is

calculated as

fi represent the relatively simple ‘expert’

θi are the parameters of ith expert

Page 6: Probability density estimation using Product of Conditional Experts

Product of Conditional Experts

Often, the individual experts are not known

We use Product of conditional experts, wherein

each expert gives the conditional probability

Page 7: Probability density estimation using Product of Conditional Experts

Product of Conditional Experts

Conditional probability can be estimated by using

classification models that associate a probability

with the output class

Classification models like Logistic Regression, Kernel

Logistic Regression, Decision Trees can be used

A wrapper model that can take in user’s choice of

experts and build the estimate

Page 8: Probability density estimation using Product of Conditional Experts

Learning/Training

Follow gradient ascent algorithm to maximize the

average log probability

Gradient for the objective function is given as:

Page 9: Probability density estimation using Product of Conditional Experts

Learning/Training

The second term in gradient expression (arising from

the normalization term) is intractable to compute

Contrastive divergence[1] is then used to

approximate the gradient

Page 10: Probability density estimation using Product of Conditional Experts

Experts Considered

Logistic Regression (linear classification model)

Kernel Logistic Regression (non linear model)

Page 11: Probability density estimation using Product of Conditional Experts

Density Estimation: Artificial Datasets

True Distribution Distribution Learnt by

Linear Model

Distribution Learnt by

Non Linear Model

Set 1

100 examples

5 dimensions

Page 12: Probability density estimation using Product of Conditional Experts

True Distribution

Set 2

100 examples

5 dimensions

Density Estimation: Artificial Datasets

Distribution Learnt by

Non Linear Model

Distribution Learnt by

Linear Model

Page 13: Probability density estimation using Product of Conditional Experts

Density Estimation: Real Datasets

Adult (200 train + 500 test)

Page 14: Probability density estimation using Product of Conditional Experts

Density Estimation: Real Datasets

Adult (500 train + 500 test)

Page 15: Probability density estimation using Product of Conditional Experts

Density Estimation: Real Datasets

OCR (200 train + 500 test)

Page 16: Probability density estimation using Product of Conditional Experts

Density Estimation: Real Datasets

OCR (500 train + 500 test)

Page 17: Probability density estimation using Product of Conditional Experts

Density Estimation

For artificial sets, the learned distribution is close to

actual distribution

MoB clearly performs better on the training

examples than the linear and non linear model

However, PoCE model generalizes better

Higher log probability on the test set

Performance is significantly better for lesser training

examples

Page 18: Probability density estimation using Product of Conditional Experts

Application: Outlier Detection

Detect points that do not belong to a particular class of points.

If the model builds a good enough representation of the data, it should assign high probability to points that are part of the inlier class and relatively low probability to points that are outside it.

We train the model on a mix of two classes, with less examples ( < 5%) from outlier class

Test to see if outliers and inliers are given low and high probabilities respectively

Page 19: Probability density estimation using Product of Conditional Experts

Application: Outlier Detection

Page 20: Probability density estimation using Product of Conditional Experts

Application: Outlier Detection

Page 21: Probability density estimation using Product of Conditional Experts

Application: Outlier Detection

Page 22: Probability density estimation using Product of Conditional Experts

Application: Outlier Detection

Page 23: Probability density estimation using Product of Conditional Experts

Application: Outlier Detection

In 3 out of 4 cases, the outliers in test and train

data get lower average probability

This hints that outlier detection can be carried out to

some extent

We now present the precision recall curves

obtained for similar class pairs. Five outliers in

training as well as test were kept to obtain the

precision recall curves

Page 24: Probability density estimation using Product of Conditional Experts

Application: Outlier Detection

CYT – MIT

Page 25: Probability density estimation using Product of Conditional Experts

Application: Outlier Detection

CYT – MIT

Page 26: Probability density estimation using Product of Conditional Experts

Application: Outlier Detection

CYT – NUC

Page 27: Probability density estimation using Product of Conditional Experts

Application: Outlier Detection

CYT – NUC

Page 28: Probability density estimation using Product of Conditional Experts

Test Set:

All the three models perform equally well

Training Set:

KLR does as good as MoB

LR does both better and worse than MoB

Application: Outlier Detection

Page 29: Probability density estimation using Product of Conditional Experts

Future Work

Partition Calculus

Annealed Importance Sampling

Evaluate on larger Datasets

Try other experts

Decision Trees

Known for interpretability of data

Page 30: Probability density estimation using Product of Conditional Experts

References

[1] Geoffrey Hinton, Training Product of Experts by minimizing Contrastive

Divergence. 2002.

[2] Hugo Larochelle, Iain Murray, The Neural Autoregressive Distribution

Estimator. 2011

[3] KL Divergence, http://en.wikipedia.org/wiki/Kullback-Leibler

divergence

[4] Restricted Boltzmann Machines, http://en.wikipedia.org/wiki/Restricted

Boltzmann machine

[5] Logistic Regression, http://en.wikipedia.org/wiki/Logistic regression

[6] Mixture Models, http://en.wikipedia.org/wiki/Mixture model

[7] UCI Repository, http://archive.ics.uci.edu/ml/

Page 31: Probability density estimation using Product of Conditional Experts

THANK YOU