Density estimation - Georgia Institute of...
Transcript of Density estimation - Georgia Institute of...
![Page 1: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/1.jpg)
Density estimation
In density estimation problems, we are given a random
sample from an unknown density
Our objective is to estimate
?
![Page 2: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/2.jpg)
Applications
Classification
• If we estimate the density for each class, we can simply
plug this in to the formula for the Bayes’ classifier
• Density estimation (for each feature) is the key component
in Naïve Bayes
Clustering
• Clusters can be defined by the density: given a point ,
climb the density until you reach a local maximum
Anomaly detection
• Given a density estimate , we can use the test
to detect anomalies in future observations
![Page 3: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/3.jpg)
Kernel density estimation
A kernel density estimate has the form
where is called a kernel
• A kernel density estimate is nonparametric
• Another name for this is the Parzen window method
• The parameter is called the bandwidth
• Looks just like kernel ridge regression, but with equal
weights
• Note that does not necessarily need to be an inner
product kernel
![Page 4: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/4.jpg)
Example
![Page 5: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/5.jpg)
Kernels
In the context of density estimation, a kernel should satisfy
1.
2.
3. for some
Examples (in )
• Uniform kernel
• Triangular kernel
• Epanichnikov kernel
• Gaussian
• …
![Page 6: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/6.jpg)
Kernel bandwidth
The accuracy of a kernel density estimate depends critically
on the bandwidth
![Page 7: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/7.jpg)
Setting the bandwidth - Theory
Theorem
Let be a kernel density estimate based on the kernel
Suppose is such that
• as
• as
Then
as , regardless of the true density
Proof: See Devroye and Lugosi, Combinatorial Methods in
Density Estimation (1987)
![Page 8: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/8.jpg)
Setting the bandwidth - Practice
Silverman’s rule of thumb
If using the Gaussian kernel, a good choice for is
where is the standard deviation of the samples
How can we apply what we know about model selection to
setting ?
• Randomly split the data into two sets
• Obtain a kernel density estimate for the first
• Measure how well the second set fits this estimate
– e.g., compute another kernel density estimate on the second
set and calculate the KL divergence between the two
• Repeat over many random splits and average
![Page 9: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/9.jpg)
Density estimation is hard…
Kernel density estimation works fairly well if you have lots of
data in extremely low-dimensional data
– e.g., 1 or 2 dimensions
Fortunately, it is not strictly necessary in many applications…
![Page 10: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/10.jpg)
Clustering
![Page 11: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/11.jpg)
Example
![Page 12: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/12.jpg)
Formal definition
Suppose
The goal of clustering is to assign the data to disjoint subsets
called clusters, so that points in the same cluster are more
similar to each other than points in different clusters
A clustering can be represented by a cluster map, which is a
function
where is the number of clusters
![Page 13: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/13.jpg)
Choose to minimize
where
Note that is assumed fixed and known
is sometimes called the “within-cluster scatter”
-means criterion
![Page 14: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/14.jpg)
Within-cluster scatter
It is possible to show that
average distance between
and all other points
in the same cluster
![Page 15: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/15.jpg)
How many clusterings?
How many possible cluster maps do we need to consider?
Solutions to this recurrence (with the natural boundary
conditions) are called Stirling’s numbers of the second kind
Examples
–
–
# of clusterings of objects into clusters
![Page 16: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/16.jpg)
There is no known efficient search strategy for this space
Can be solved (exactly) in time
Completely impractical unless both and are extremely
small
• e.g., already results in
More formally, minimizing the -means criterion is a
combinatorial optimization problem (NP-hard)
Instead, we resort to an iterative, suboptimal algorithm
Minimizing the -means criterion
![Page 17: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/17.jpg)
Another look at -means
Recall that we want to find
Note that for fixed
Therefore, we can equivalently write
![Page 18: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/18.jpg)
An iterative algorithm
This suggests an iterative algorithm
1. Given , choose to minimize
2. Given , choose to minimize
![Page 19: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/19.jpg)
-means clustering algorithm
The solutions to each sub-problem are given by
1.
2.
Algorithm
Initialize
Repeat until clusters don’t change
–
–
![Page 20: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/20.jpg)
Initialization
Traditionally, the algorithm is typically initialized by setting
each to be a random point in the dataset
However, depending on the initialization, the algorithm can
get stuck in a local minimum
One can avoid this by:
• repeating for several random initializations
• initialize by sequentially selecting random points in the
dataset, but with a probability depending on how far the
point is from the already selected : -means ++
![Page 21: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/21.jpg)
Clusters are “nearest neighbor” regions or Vornoi cells
defined with respect to the cluster means
Cluster boundaries are formed by the intersections of
hyperplanes
-means will “fail” if clusters are nonconvex
Cluster geometry
![Page 22: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/22.jpg)
Remarks
• Algorithm originally developed at Bell Labs as an approach
to vector quantization
• If we replace the norm with the norm in our function
, then
– the geometry of our Vornoi regions will change
– the “center” of each region is actually calculated via the
median in each dimension
– results in -medians clustering
![Page 23: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/23.jpg)
Model selection for -means
How to choose ?
Let be the within-cluster scatter based on clusters
If the “right” number of clusters is , we expect
• for , will be large
• for , will be small
This suggests choosing to be near the “knee” of the curve
![Page 24: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/24.jpg)
Another take on -means
I have followed the standard development of the -means
clustering algorithm, but there is another way to view this
algorithm…
as simply another instance of structured matrix factorization
where
and has exactly one “1” per column (the rest being zero)
![Page 25: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/25.jpg)
Beyond -means clustering
In -means clustering, by measuring the within-cluster
scatter as
we are implicitly assuming that each cluster is roughly
spherical in shape
This is probably unrealistic
But if we don’t even know what the clusters are
we definitely don’t know how they are shaped…
![Page 26: Density estimation - Georgia Institute of Technologymdav.ece.gatech.edu/ece-6254-spring2017/notes/19-KDE-k...Applications Classification •If we estimate the density for each class,](https://reader034.fdocuments.in/reader034/viewer/2022042117/5e95e6ac2e539a64ec401c51/html5/thumbnails/26.jpg)
Clustering with Gaussian mixture models
One way to extend the basic idea behind -means clustering
to allow for more general cluster shapes is to assume
• the clusters are elliptical
• each cluster can be modeled using a multivariate
Gaussian density
• the full data set is modeled using a Gaussian mixture
model (GMM)
We can then perform clustering based on a maximum
likelihood estimation of the GMM