From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe...
Transcript of From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe...
![Page 1: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/1.jpg)
From Histograms to Multivariate Polynomial
Histograms and Shape Estimation
Assoc Prof Inge Koch
Statistics, School of Mathematical Sciences
University of Adelaide
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 1 / 27
![Page 2: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/2.jpg)
Motivation: determine the shape of data
We have 12 measurements on each of 27,994 blood cells
How many cluster?
How big are they and where are they?
Data: Centre for Immunology, St Vincent Hospital, Sydney
Immunologists want to differentiate between
healthy individuals from those with HIV+.
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 2 / 27
![Page 3: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/3.jpg)
Look at the (Log-Data)
3 4 5 6 7
24
68
0
5
CD4
2000 blood cells
CD8
CD
3
2 4 605
0
2
4
6
8
CD4
4000 blood cells
CD8
CD
3
0 5 0510
0
2
4
6
8
CD8
10000 blood cells
CD4
CD
3
0 5 0510
0
2
4
6
8
CD8
27994 blood cells
CD4
CD
3
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 3 / 27
![Page 4: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/4.jpg)
Histograms of the (Log-Data)
0 5 100
500
1000
1500
2000CD3 10 bins
0 5 100
1000
2000
3000
4000CD3 5 bins
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 4 / 27
![Page 5: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/5.jpg)
Histograms of the (Log-Data)
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 5 / 27
![Page 6: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/6.jpg)
How Many Cluster are in the Data?
One-dimensional data: 1 or 2 modes;
Two-dimensional data: 1 to 3 or 4 modes;
How many clusters are in the 12-dimensional data?
If the measurements were independent,
then the number of modes would be the product
→ but this is not the case in our data
Can you think of a 3D example with k modes such that the 2D
projections have k − 1 modes?
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 6 / 27
![Page 7: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/7.jpg)
Polynomial Histogram Estimators
Main idea
histograms have flat tops, so instead of
only estimating the number of points in each bin
estimate the shape separately in each bin
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 7 / 27
![Page 8: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/8.jpg)
What are Polynomial Histogram Estimators?
Number of observations n, dimension d , binwidth h
B` = hd a bin with n` observations
The model for each bin B`
1 histogram estimators (Hist) f0(x) = a02 first-order polynomial histogram estimator (Fophe)
f1(x) = a0 + aTx
3 second-order polynomial histogram estimator (Sophe)
f2(x) = a0 + aTx + xTAx
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 8 / 27
![Page 9: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/9.jpg)
Relationships for Coefficients
In each bin B` the estimate fk satisfies
1 proportion of data∫B`
fk(x)dx =n`n
2 local mean ∫B`
xfk(x)dx =n`n
x̄`
3 local second moment∫B`
xxT fk(x)dx =n`nM`
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 9 / 27
![Page 10: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/10.jpg)
The New Estimators
In each bin B` with bin centre t`
Fophe
f̂1(x) =1
hd+2n`n
[h2 + 12(x̄` − t`)
T (x− t`)]
Sophe
f̂2(x) =1
hd+4n`n×{
(4 + 5d)
4h4 − 15h2 tr (S`) + 12h2(x− t`)
T (x̄` − t`)
+ (x− t`)T[72S` + 108 diag(S`)− 15h2I
](x− t`)
}.
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 10 / 27
![Page 11: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/11.jpg)
Roederer Data: 10,000 observations, CD4 & CD8
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 11 / 27
![Page 12: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/12.jpg)
The performance of estimators
We assess the performance of estimators with the MSE.
Let θ̂ be an estimator for a true quantity θ. Then
MSE(θ̂) =[
bias(θ̂)]2
+ var(θ̂)
bias(θ̂) = Eθ̂ − θ
var(θ̂) ={E[θ̂ − Eθ̂
]}2= E
[θ̂2]−[Eθ̂]2
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 12 / 27
![Page 13: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/13.jpg)
Sophe’s Performance
For a fixed point x ∈ B` we want the bias of f̂ = f̂2 at x
Consider
E[f̂ (x)
]= E
(1
hd+4n`n×{
(4 + 5d)
4h4 − 15h2 tr (S`) + 12h2(x− t`)
T (x̄` − t`)
+ (x− t`)T[72S` + 108 diag(S`)− 15h2I
](x− t`)
})
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 13 / 27
![Page 14: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/14.jpg)
Some Expectation Calculations I
We show that
E[n`n
(x̄` − t`)]
=
∫B`
(y − t`)f (y)dy
and so
E[
12h2
hd+4(x− t`)
T n`n
(x̄` − t`)
]=
12
hd+2(x− t`)
T
∫B`
(y − t`)f (y)dy
then use a Taylor expansion of f about the bin centre t`
f (y) = f (t`) + (y − t`)Df (t`) +1
2(y − t`)
2D2f (t`)
+1
6(y − t`)
3D3f (t`) + o(‖y − t`‖3
)Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 14 / 27
![Page 15: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/15.jpg)
Some Expectation Calculations II
The first non-zero integral gives
E[
12
hd+2(x− t`)
T n`n
(x̄` − t`)
]≈ (x− t`)
TDf (t`)
We prove similar results for all terms contributing to E[f̂ (x)
]. . . and finally get
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 15 / 27
![Page 16: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/16.jpg)
The Bias
E[f̂ (x)] = f (t`) + (x− t`)TDf (t`) +
1
2(x− t`)
2D2f (t`)
+h2
12(x− t`)
T
(∑i fuii2−fuuu
5
)+ o(h3)
Taylor expansion of f about the bin centre t`
f (x) = f (t`) + (x− t`)Df (t`) +1
2(x− t`)
2D2f (t`)
+1
6(x− t`)
3D3f (t`) + o(‖x− t`‖3
)so bias[f̂ (x)] depends on difference of 3rd order derivatives
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 16 / 27
![Page 17: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/17.jpg)
Moving on . . .
and making some big leaps
We have the following steps in the performance calculations
1 pointwise bias and variance → MSE at f̂ (x)
2 integrated squared bias and integrated variance of f̂ over all x
3 finally some asymptotics when n →∞
We want to know how Fophe and Sophe depend on the sample
size n, the binwidth h, and the dimension d
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 17 / 27
![Page 18: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/18.jpg)
How Good are Fophe and Sophe
Bias2 Variance Rate of Convergence
hist CHh2 1
nhdn−2/(d+2)
kernel CKh4 R(K )
nhdn−4/(d+4)
fophe CFh4 d + 1
nhdn−4/(d+4)
sophe CSh6 (d + 1)(d + 2)
2nhdn−6/(d+6)
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 18 / 27
![Page 19: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/19.jpg)
Performance for 200, 1000 and 10000 Observations
50 100 150 2000
1
2
3
4
5x 10
−7
50 100 150 200 2500
0.5
1
1.5
2
2.5x 10
−6
5 10 15 200
1
2
3
4
5x 10
−4
kernel
Fophe
hist
Sophe
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 19 / 27
![Page 20: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/20.jpg)
27,994 obs: Kernel est. takes 92× Sophe
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 20 / 27
![Page 21: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/21.jpg)
Advantages of Fophe and Sophe
Computational advantages
1 a smaller number of bins is required
2 number of bins only needs to be approximately correct
Sophe better than Fophe in visual and computational aspects
→ use Sophe for data
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 21 / 27
![Page 22: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/22.jpg)
Finding Modes with the Sophe
1 Fix binwidth h0, # of bins νbin, thresholds θ0, and κ.2 Find bins with high density.
1 Find n` in each bin, and discard bins that contain fewer than θ0observations. Let B0 = {B` : n` > θ0}.
2 Sort bins in B0 by # of observations, starting with largest.
3 Determine modes from B0 using (1) or (2) below.1 For i , j = 1 . . . , κ calculate pairwise distances ∆(i ,j) between the bin
centres. For i consider the set of nearest neighbours
nn(i) ={
(∆(i ,j), n(j)) : ∆(i ,j) ≤ h0}.
B(i) contains a mode, if n(i) is maximum over nn(i).2 If matrix A(j) is negative definite, then B(j) contains a mode.
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 22 / 27
![Page 23: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/23.jpg)
Look at the (Log-Data)
3 4 5 6 7
24
68
0
5
CD4
2000 blood cells
CD8
CD
3
2 4 605
0
2
4
6
8
CD4
4000 blood cells
CD8
CD
3
0 5 0510
0
2
4
6
8
CD8
10000 blood cells
CD4
CD
3
0 5 0510
0
2
4
6
8
CD8
27994 blood cells
CD4
CD
3
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 23 / 27
![Page 24: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/24.jpg)
Modes for 12-Dimensional Data
Use 5 bins in each variable
compare # of modes and % of non-empty bins
# variables # modes # of bins % non-empty
CDs 3,4,8 3 125 39.2
+ CDs 14, 19, 56 5 15625 2.6
all 12 9 244,140,625 0.0015
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 24 / 27
![Page 25: From Histograms to Multivariate Polynomial Histograms and ...€¦ · Finding Modes with the Sophe 1 Fix binwidth h 0, # of bins bin, thresholds 0, and . 2 Find bins with high density.](https://reader034.fdocuments.in/reader034/viewer/2022042415/5f30cb56acf89e275a3e5945/html5/thumbnails/25.jpg)
The End
J Jing, I Koch and K Naito (2009). Polynomial Histograms for
Multivariate Density and Mode Estimation preprint.
Thank you
Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 2009 25 / 27