Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf ·...
Transcript of Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf ·...
![Page 1: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/1.jpg)
Density estimation
In density estimation problems, we are given a random
sample from an unknown density
Our objective is to estimate
?
![Page 2: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/2.jpg)
Applications
Classification
• If we estimate the density for each class, we can simply
plug this in to the formula for the Bayes’ classifier
• Density estimation (for each feature) is the key component
in Naïve Bayes
Clustering
• Clusters can be defined by the density: given a point ,
climb the density until you reach a local maximum
Anomaly detection
• Given a density estimate , we can use the test
to detect anomalies in future observations
![Page 3: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/3.jpg)
Kernel density estimation
A kernel density estimate has the form
where is called a kernel
• A kernel density estimate is nonparametric
• Another name for this is the Parzen window method
• The parameter is called the bandwidth
• Looks just like kernel ridge regression, but with equal
weights
• Note that does not necessarily need to be an inner
product kernel
![Page 4: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/4.jpg)
Example
![Page 5: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/5.jpg)
Kernels
In the context of density estimation, a kernel should satisfy
1.
2.
3. for some
Examples (in )
• Uniform kernel
• Triangular kernel
• Epanichnikov kernel
• Gaussian
• …
![Page 6: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/6.jpg)
Kernel bandwidth
The accuracy of a kernel density estimate depends critically
on the bandwidth
![Page 7: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/7.jpg)
Setting the bandwidth - Theory
Theorem
Let be a kernel density estimate based on the kernel
Suppose is such that
• as
• as
Then
as , regardless of the true density
Proof: See Devroye and Lugosi, Combinatorial Methods in
Density Estimation (1987)
![Page 8: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/8.jpg)
Setting the bandwidth - Practice
Silverman’s rule of thumb
If using the Gaussian kernel, a good choice for is
where is the standard deviation of the samples
How can we apply what we know about model selection to
setting ?
• Randomly split the data into two sets
• Obtain a kernel density estimate for the first
• Measure how well the second set fits this estimate
– e.g., compute another kernel density estimate on the second
set and calculate the KL divergence between the two
• Repeat over many random splits and average
![Page 9: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/9.jpg)
Density estimation is hard…
Kernel density estimation works fairly well if you have lots of
data in extremely low-dimensional data
– e.g., 1 or 2 dimensions
Fortunately, it is not strictly necessary in many applications…
![Page 10: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/10.jpg)
Clustering
![Page 11: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/11.jpg)
Example
![Page 12: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/12.jpg)
Formal definition
Suppose
The goal of clustering is to assign the data to disjoint subsets
called clusters, so that points in the same cluster are more
similar to each other than points in different clusters
A clustering can be represented by a cluster map, which is a
function
where is the number of clusters
![Page 13: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/13.jpg)
Choose to minimize
where
Note that is assumed fixed and known
is sometimes called the “within-cluster scatter”
-means criterion
![Page 14: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/14.jpg)
Within-cluster scatter
It is possible to show that
average distance between
and all other points
in the same cluster
![Page 15: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/15.jpg)
How many clusterings?
How many possible cluster maps do we need to consider?
Solutions to this recurrence (with the natural boundary
conditions) are called Stirling’s numbers of the second kind
Examples
–
–
# of clusterings of objects into clusters
![Page 16: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/16.jpg)
There is no known efficient search strategy for this space
Can be solved (exactly) in time
Completely impractical unless both and are extremely
small
• e.g., already results in
More formally, minimizing the -means criterion is a
combinatorial optimization problem (NP-hard)
Instead, we resort to an iterative, suboptimal algorithm
Minimizing the -means criterion
![Page 17: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/17.jpg)
Another look at -means
Recall that we want to find
Note that for fixed
Therefore, we can equivalently write
![Page 18: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/18.jpg)
An iterative algorithm
This suggests an iterative algorithm
1. Given , choose to minimize
2. Given , choose to minimize
![Page 19: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/19.jpg)
-means clustering algorithm
The solutions to each sub-problem are given by
1.
2.
Algorithm
Initialize
Repeat until clusters don’t change
–
–
![Page 20: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/20.jpg)
Initialization
Traditionally, the algorithm is typically initialized by setting
each to be a random point in the dataset
However, depending on the initialization, the algorithm can
get stuck in a local minimum
One can avoid this by:
• repeating for several random initializations
• initialize by sequentially selecting random points in the
dataset, but with a probability depending on how far the
point is from the already selected : -means ++
![Page 21: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/21.jpg)
Clusters are “nearest neighbor” regions or Vornoi cells
defined with respect to the cluster means
Cluster boundaries are formed by the intersections of
hyperplanes
-means will “fail” if clusters are nonconvex
Cluster geometry
![Page 22: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/22.jpg)
Remarks
• Algorithm originally developed at Bell Labs as an approach
to vector quantization
• If we replace the norm with the norm in our function
, then
– the geometry of our Vornoi regions will change
– the “center” of each region is actually calculated via the
median in each dimension
– results in -medians clustering
![Page 23: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/23.jpg)
Model selection for -means
How to choose ?
Let be the within-cluster scatter based on clusters
If the “right” number of clusters is , we expect
• for , will be large
• for , will be small
This suggests choosing to be near the “knee” of the curve
![Page 24: Density estimationanderson.ece.gatech.edu › ... › assets › 19-kde-k-means-marked.pdf · 2018-04-04 · Kernel density estimation A kernel density estimate has the form where](https://reader034.fdocuments.in/reader034/viewer/2022042403/5f1590e27e484640913fecf2/html5/thumbnails/24.jpg)
Another take on -means
I have followed the standard development of the -means
clustering algorithm, but there is another way to view this
algorithm…
as simply another instance of structured matrix factorization
where
and has exactly one “1” per column (the rest being zero)