Density estimation - Georgia Institute of...

26
Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate ?

Transcript of Density estimation - Georgia Institute of...

Page 1: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Density estimation

In density estimation problems, we are given a random

sample from an unknown density

Our objective is to estimate

?

Page 2: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Applications

Classification

• If we estimate the density for each class, we can simply

plug this in to the formula for the Bayes’ classifier

• Density estimation (for each feature) is the key component

in Naïve Bayes

Clustering

• Clusters can be defined by the density: given a point ,

climb the density until you reach a local maximum

Anomaly detection

• Given a density estimate , we can use the test

to detect anomalies in future observations

Page 3: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Kernel density estimation

A kernel density estimate has the form

where is called a kernel

• A kernel density estimate is nonparametric

• Another name for this is the Parzen window method

• The parameter is called the bandwidth

• Looks just like kernel ridge regression, but with equal

weights

• Note that does not necessarily need to be an inner

product kernel

Page 4: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Example

Page 5: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Kernels

In the context of density estimation, a kernel should satisfy

1.

2.

3. for some

Examples (in )

• Uniform kernel

• Triangular kernel

• Epanichnikov kernel

• Gaussian

• …

Page 6: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Kernel bandwidth

The accuracy of a kernel density estimate depends critically

on the bandwidth

Page 7: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Setting the bandwidth - Theory

Theorem

Let be a kernel density estimate based on the kernel

Suppose is such that

• as

• as

Then

as , regardless of the true density

Proof: See Devroye and Lugosi, Combinatorial Methods in

Density Estimation (1987)

Page 8: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Setting the bandwidth - Practice

Silverman’s rule of thumb

If using the Gaussian kernel, a good choice for is

where is the standard deviation of the samples

How can we apply what we know about model selection to

setting ?

• Randomly split the data into two sets

• Obtain a kernel density estimate for the first

• Measure how well the second set fits this estimate

– e.g., compute another kernel density estimate on the second

set and calculate the KL divergence between the two

• Repeat over many random splits and average

Page 9: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Density estimation is hard…

Kernel density estimation works fairly well if you have lots of

data in extremely low-dimensional data

– e.g., 1 or 2 dimensions

Fortunately, it is not strictly necessary in many applications…

Page 10: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Clustering

Page 11: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Example

Page 12: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Formal definition

Suppose

The goal of clustering is to assign the data to disjoint subsets

called clusters, so that points in the same cluster are more

similar to each other than points in different clusters

A clustering can be represented by a cluster map, which is a

function

where is the number of clusters

Page 13: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Choose to minimize

where

Note that is assumed fixed and known

is sometimes called the “within-cluster scatter”

-means criterion

Page 14: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Within-cluster scatter

It is possible to show that

average distance between

and all other points

in the same cluster

Page 15: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

How many clusterings?

How many possible cluster maps do we need to consider?

Solutions to this recurrence (with the natural boundary

conditions) are called Stirling’s numbers of the second kind

Examples

# of clusterings of objects into clusters

Page 16: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

There is no known efficient search strategy for this space

Can be solved (exactly) in time

Completely impractical unless both and are extremely

small

• e.g., already results in

More formally, minimizing the -means criterion is a

combinatorial optimization problem (NP-hard)

Instead, we resort to an iterative, suboptimal algorithm

Minimizing the -means criterion

Page 17: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Another look at -means

Recall that we want to find

Note that for fixed

Therefore, we can equivalently write

Page 18: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

An iterative algorithm

This suggests an iterative algorithm

1. Given , choose to minimize

2. Given , choose to minimize

Page 19: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

-means clustering algorithm

The solutions to each sub-problem are given by

1.

2.

Algorithm

Initialize

Repeat until clusters don’t change

Page 20: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Initialization

Traditionally, the algorithm is typically initialized by setting

each to be a random point in the dataset

However, depending on the initialization, the algorithm can

get stuck in a local minimum

One can avoid this by:

• repeating for several random initializations

• initialize by sequentially selecting random points in the

dataset, but with a probability depending on how far the

point is from the already selected : -means ++

Page 21: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Clusters are “nearest neighbor” regions or Vornoi cells

defined with respect to the cluster means

Cluster boundaries are formed by the intersections of

hyperplanes

-means will “fail” if clusters are nonconvex

Cluster geometry

Page 22: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Remarks

• Algorithm originally developed at Bell Labs as an approach

to vector quantization

• If we replace the norm with the norm in our function

, then

– the geometry of our Vornoi regions will change

– the “center” of each region is actually calculated via the

median in each dimension

– results in -medians clustering

Page 23: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Model selection for -means

How to choose ?

Let be the within-cluster scatter based on clusters

If the “right” number of clusters is , we expect

• for , will be large

• for , will be small

This suggests choosing to be near the “knee” of the curve

Page 24: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Another take on -means

I have followed the standard development of the -means

clustering algorithm, but there is another way to view this

algorithm…

as simply another instance of structured matrix factorization

where

and has exactly one “1” per column (the rest being zero)

Page 25: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Beyond -means clustering

In -means clustering, by measuring the within-cluster

scatter as

we are implicitly assuming that each cluster is roughly

spherical in shape

This is probably unrealistic

But if we don’t even know what the clusters are

we definitely don’t know how they are shaped…

Page 26: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,

Clustering with Gaussian mixture models

One way to extend the basic idea behind -means clustering

to allow for more general cluster shapes is to assume

• the clusters are elliptical

• each cluster can be modeled using a multivariate

Gaussian density

• the full data set is modeled using a Gaussian mixture

model (GMM)

We can then perform clustering based on a maximum

likelihood estimation of the GMM