Probability density estimation using Product of Conditional Experts

PROBABILITY DENSITY ESTIMATION

USING PRODUCT OF CONDITIONAL

EXPERTS

Project Guides:

Dr. Harish Karnick (IIT Kanpur)

Dr. Vinod Nair (MSR India)

Dr. Sundar S. (MSR India)

Submitted by:

Chirag Gupta 10212

Pulkit Jain 10543

Density Estimation

Construct an estimate of the underlying probability

distribution function from observed data

Underlying pattern in data

Useful statistical information

Modality

Skewness

How is it done ?

The observed data is considered a set of i.i.d.

samples from the distribution

Choose a model that can estimate the underlying

probability density function

Fit the model to the observed data

Maximum Likelihood Estimation

Maximize the probability of observed data

p(x) = f(x, θ)

Maximize p(data) = p(x1) * p(x2) * … * p(xn)

Background/Previous Work

High dimensional data is modeled by combining relatively simple models

Mixture models

Combination rule consists of taking a weighted arithmetic mean of the individual distributions.

Inefficient for high-dimensional data

Product of Experts

Combination rule is to multiply the relatively simple probabilistic models and renormalize

High dimensional data is relatively easy to handle

Product Of Experts

Probability of a data point d under the model is

calculated as

fi represent the relatively simple ‘expert’

θi are the parameters of ith expert

Product of Conditional Experts

Often, the individual experts are not known

We use Product of conditional experts, wherein

each expert gives the conditional probability

Product of Conditional Experts

Conditional probability can be estimated by using

classification models that associate a probability

with the output class

Classification models like Logistic Regression, Kernel

Logistic Regression, Decision Trees can be used

A wrapper model that can take in user’s choice of

experts and build the estimate

Learning/Training

Follow gradient ascent algorithm to maximize the

average log probability

Gradient for the objective function is given as:

Learning/Training

The second term in gradient expression (arising from

the normalization term) is intractable to compute

Contrastive divergence[1] is then used to

approximate the gradient

Experts Considered

Logistic Regression (linear classification model)

Kernel Logistic Regression (non linear model)

Density Estimation: Artificial Datasets

True Distribution Distribution Learnt by

Linear Model

Distribution Learnt by

Non Linear Model

100 examples

5 dimensions

True Distribution

100 examples

5 dimensions

Density Estimation: Artificial Datasets

Non Linear Model

Linear Model

Density Estimation: Real Datasets

Adult (200 train + 500 test)

Adult (500 train + 500 test)

OCR (200 train + 500 test)

OCR (500 train + 500 test)

Density Estimation

For artificial sets, the learned distribution is close to

actual distribution

MoB clearly performs better on the training

examples than the linear and non linear model

However, PoCE model generalizes better

Higher log probability on the test set

Performance is significantly better for lesser training

examples

Application: Outlier Detection

Detect points that do not belong to a particular class of points.

If the model builds a good enough representation of the data, it should assign high probability to points that are part of the inlier class and relatively low probability to points that are outside it.

We train the model on a mix of two classes, with less examples ( < 5%) from outlier class

Test to see if outliers and inliers are given low and high probabilities respectively

In 3 out of 4 cases, the outliers in test and train

data get lower average probability

This hints that outlier detection can be carried out to

some extent

We now present the precision recall curves

obtained for similar class pairs. Five outliers in

training as well as test were kept to obtain the

precision recall curves

CYT – MIT

CYT – NUC

Test Set:

All the three models perform equally well

Training Set:

KLR does as good as MoB

LR does both better and worse than MoB

Future Work

Partition Calculus

Annealed Importance Sampling

Evaluate on larger Datasets

Try other experts

Decision Trees

Known for interpretability of data

References

[1] Geoffrey Hinton, Training Product of Experts by minimizing Contrastive

Divergence. 2002.

[2] Hugo Larochelle, Iain Murray, The Neural Autoregressive Distribution

Estimator. 2011

[3] KL Divergence, http://en.wikipedia.org/wiki/Kullback-Leibler

divergence

[4] Restricted Boltzmann Machines, http://en.wikipedia.org/wiki/Restricted

Boltzmann machine

[5] Logistic Regression, http://en.wikipedia.org/wiki/Logistic regression

[6] Mixture Models, http://en.wikipedia.org/wiki/Mixture model

[7] UCI Repository, http://archive.ics.uci.edu/ml/

THANK YOU

Probability density estimation using Product of Conditional Experts

Technology

Transcript of Probability density estimation using Product of Conditional Experts

4.3 Conditional Probability Conditional probability Conditional probability is the probability that the second event B occurs given that the first event.

Conditional Probability and Probability Trees Statistics

Conditional Probability

Lesson 3 Conditional Probability and Compound Events … Lesson 3 Conditional Probability... · Unit #9: Probability and Statistics Lesson #3: Conditional Probability and Compound

Sec 2.3 -Probability Conditional Probability & Exclusive ......Sec 2.3 -Probability Conditional Probability & Exclusive Events Name: CONDITIONAL PROBABILITY 1. Determine the following

Lecture 04: Conditional Probability - Stanford University · 2018. 7. 3. · Conditional Probability Conditional probability is the probability that E occurs given that F has already

Screening and Conditional Probability

Conditional probability, and probability trees

4.3 Conditional Probability

3.2-Conditional Probability

Conditional Probability, Total Probability Theorem and Bayes

13.3 Conditional Probability and Intersection of Eventsmath.utoledo.edu/~dgajews/1180/13-3 Conditional Probability and... · 13.3 Conditional Probability and Intersection of Events.

Conditional & Joint Probability

Conditional Probability and Independencesarva/courses/EE325/... · Conditional Probability Deﬁnition If P(B) > 0 then the conditional probability that A occurs given that B occurs

QBM117 Business Statistics Probability Conditional Probability.

Understand conditional probability.

Conditional Probability Practice

Unit: Probability 12-2: Conditional Probability

5.3 Conditional Probability

Conditional Probability - DR WALEED