Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas...

Clustering and Testing in High-Dimensional Data

M. Radavičius, G. Jakimauskas,J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)

The problem

Let X = XN be a sample of size N supposed to satisfy d-dimen-sional Gaussian mixture model (d is supposed to be large).

Because of large dimension it is natural to project the sample to k-dimensional (k = 1, 2,…) linear subspaces using projection pursuit method (Huber (1985), Friedman (1987)) which gives the best selection of these subspaces. If distribution of standardized sample on the complement space becomes standard Gaussian, this linear subspace H is called discriminant subspace. E. g., if we have q Gaussian mixture components with equal covariance matrices then dimension of the discriminant subspace is q–1.

Having an estimate of the discriminant subspace we can perform much easier classification using projected sample.

The sequential procedure applied to the standardized sample is the following (k = 1, 2,…, until the hypothesis of discriminant subspace holds for some k):

1. Find the best k-dimensional linear subspace using projection pursuit method (Rudzkis and Radavičius (1999)).

2. Fit a Gaussian mixture model for the sample projected to the k-dimensional linear subspace (Rudzkis and Radavičius (1995)).

3. Test goodness-of-fit of the estimated d-dimensional model assuming that distribution on the complement space is standard Gaussian. If the test fails then increase k and go to the step 1.

The problem in step 1 is to find basic vectors in high-dimension space (we do not cover this problem by now). The problem in step 3 (in common approach) is comparing some non-parametric density estimate with parametric one in high-dimensional space.

We present a simple, data-driven and computationally efficient procedure for testing goodness-of-fit. The procedure is based on well-known interpretation of testing goodness-of-fit as the classi-fication problem, a special sequential data partition procedure, randomization and resampling, elements of sequential testing. Monte-Carlo simulations are used to assess the performance of the procedure.

This procedure can be applied to the testing of independence of components in high-dimensional data.

We present some preliminary computer simulation results.

IntroductionLet

,,...,2,1,, qipiYX i P

).(),(~

,),,(~,:

Consider general classification problem of estimation ofa posteriori probabilities

),,(| xixXi P

from the sample

.,..., 21N

N XXXX

Under these assumptions we have

xpxi ii

Usually the EM algorithm is used to estimate the a posteriori probabilities. Denote

,,,...,2,1),,( NN XXqiXi

then EM algorithm is a following iterative procedure:

...ˆˆˆˆ... NN

EM algoritm converges to some local maximum of the maximum likelihood function

which usually is not equal to the global maximum

,),(log)(1

.)(maxarg* l

Let for some subspace

,,),...,,(span 11 dkvvvH k

the following equality holds:

,|| HH xXixXi PP

and the subspace H has a maximum dimension, then this subspace is called the discriminant subspace. We do not lose an information on the a posteriori probabilities when we project the sample to the discriminant subspace.

,),cov(,),( T XXVVhuhu

We can get the estimate of the discriminant subspace

,,)ˆ,...,ˆ,ˆ(spanˆ21 dkvvvH k

using projection pursuit procedure (see e. g. , J. H. Friedman (1987), S. A. Aivazyan (1996), R. Rudzkis, M. Radavičius (1998)).

Test statistics

Let )(),...,2(),1( NXXXX be a sample of the size N of i.i.d.

random vectors with a common distribution function F on Rd.

distributions. Consider a nonparametric hypothesis testing problem:

.vs.: AH FFH FF

Let HF and AF be two disjoint classes of d-dimensional

Let .AH FFF

),1,0(,)1()( ppFFpF Hp

Consider a mixture model

of two populations H and with d.f. FH and F, respecti-vely. Fix p and let Y = Y(p) denote a random vector with the mixture distribution F(p). Let Z = Z(p) be the posterior proba- bility of the population given Y, i.e.

.)()1()(

YfpYpfYpf

Here f and fH denote distribution densities of F and FH, respecti-vely. Let us introduce a loss function l(F, FH) = E(Z – p)2.

.)(HXXY

,,...,1,0,,},,...,1,0,{Let 10 KkPPPKkP kkd

k RPbe a sequence of partitions of Rd, possibly dependent on Y, and let

be the corresponding sequence of -algebras

},,...,1,0,{ Kkk F

generated by these partitions.

A computationally efficient choice of P is the sequential dyadic coordinate-wise partition minimizing at each step the mean square error.

Let X(H) = {X(H)(1), X(H)(2),…, X(H)(M)} be a sample of size M of i.i.d. vectors from H. It is also supposed that X(H) is independent of X. Set

In view of the definition of the loss function a natural choice of the test statistics would be 2-type statistics

,],|[where,)( 2

pZZpZT kMNkkMNk FEE

for some },...,2,1{ Kk which can be treated as a smoothingparameter. Here EMN stands for the expectation with respect to the empirical distribution F of Y.

However, since the optimal value of k is unknown, we prefer the following definition of the test statistics:

,/)(max1

where ak and bk are centering and scaling parameters to be specified.

We have selected the following test statistics:

,,...,2,1,)1(2

))1((Kk

,)(21 2

kjkjk aa

sample X(H)) in the jth area of the kth partition Pk.

(resp. sample of elements ofnumber areandhere ),(2

),(1 Xkjkj aa

Illustration of the sequential dyadic partitioning procedure

Here we have an example (at some step) of sequential partitioning procedure with two samples of two-dimen-sional data. The next partition is selected from all current squares and all divisions by each dimen-sion (in this case d=2) to achieve minimum mean square error of grouping.

Preliminary simulation results

The computer simulations have been performed using Monte-Carlo simulation method (typically 100 independent simulations). Sample sizes of X and X(H) were selected equal (typically N = M = 1000).

The first problem is to evaluate using the computer simulation the test statistics Tk in case when the hypothesis H holds. Centering and scaling parameters of the test statistics were selected in such a way that distribution of the test statistics is approximately standard Gaussian for each k not very close to 1 and K.

The computer simulation results show that for very wide range of dimensions, sample sizes and distributions behaviour of the test statistics in case when the hypothesis H holds is very similar.

Fig. 1. Behaviour of Tk when the hypothesis holds

Here we have sample size N=1000, dimension d=100, and two samples of d-dimensional standard Gaussian distribution. We have maxima and minima of 100 realizations and corresponding maxima and minima except of 5 per cent largest values at each point.

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Fig. 2. Behaviour of Tk when the hypothesis does not hold

Here we have sample size N=1000, dimension d=10, q=3, Gaussian mixture with means (–4, –3, 0, 0, 0,…), (0, 6, 0, 0, 0,…), (4, –3, 0, 0, 0,…). The sample is projected to one-dimensional subspace. This is an extremely unfit situation.

0 100 200 300 400 500

Fig. 3. Behaviour of Tk (control data)

This is a control example for the data in Fig. 2 assuming that we project data to the true two-dimensional discriminant subspace.

0 100 200 300 400 500

Here we have sample size N=1000, dimension d=10, q=3, Gaussian mixture with means (–4, –1, 0, 0, 0,…), (0, 2, 0, 0, 0,…), (4, –1, 0, 0, 0,…). The sample is projected to one-dimensional subspace.

0 100 200 300 400 500

Here we have sample size N=1000, dimension d=10, q=3, Gaussian mixture with means (–4, –0.5, 0, 0, 0,…), (0, 1, 0, 0, 0,…), (4, –0.5, 0, 0, 0,…). The sample is projected to one-dimensional subspace.

0 100 200 300 400 500

Here we have sample size N=1000, dimension d=20, and standard Cauchy distribution. Sample XH is simulated with independent components, number of independent components are d1 = d/2, d2 = d/2.

0 200 400 600 800 1000 1200

This is a control example for the data in Fig. 8 assuming that the sample X(H) is simulated as sample with the same distribution as the sample X.

0 200 400 600 800 1000 1200

Here we have sample size N=1000, dimension d=10, and Student distribution with 3 degrees of freedom . Number of independent components are d1 = 1, d2 = d–1.

0 200 400 600 800 1000 1200

This is a control example for the data in Fig. 10 assuming that the sample X(H) is simulated as sample with the same distribution as the sample X.

0 200 400 600 800 1000 1200

Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas...

Documents

Transcript of Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas...

Entrepreneurship Promotion Fund 2014-2020, Lithuania · • Lithuania • Capital: Vilnius • Major cities: Vilnius, Kaunas, Klaipėda • Joined EU in 2004 • National currency

Vilnius, Lithuania · Vilnius, Lithuania. June 1-5, 2004 SUPPORTED BY IPC 2004 DEVELOPMENT FUNDS Theory "Goalball Officials' Certification Program was presented by Mr.Alexander Knext,

Regions of Lithuania: Vilnius

4-6 September, 2013 Vilnius, Lithuania 2nd Notice for Entrepreneurship in the 21st Century will be held in Vilnius, Lithuania, 4-6 September, 2013. This ... Other hotels in Vilnius

VILNIUS ANTAKALNIS GYMNASIUM – YWCA-YMCA of Lithuania - YMCA Finland

Welcome to Vilnius! - ESN Vilnius Universityvu.esnlithuania.org/sites/...to-Vilnius-2016-2017.pdf · When Lithuania restored independence in 1990, Vilnius University regained autonomy,

Verkiu 34a, Vilnius, Lithuania Telephone: +370 5 277 3620 ...fineco.lt/media/pdf/Sludge_presentation.pdf · Verkiu 34a, Vilnius, Lithuania Telephone: +370 5 277 3620 E-mail: info@fineco.lt

Energetic of Lithuania Vilnius “Židinio” Secondary School for Adults Vilnius, 2011.

Herbal ethnopharmacology of Lithuania Vilnius region III. …horizon.documentation.ird.fr/.../colloques2/010005547.pdf · 2013-10-16 · Herbal ethnopharmacology of Lithuania / Vilnius

INGSTAD & CO INGSTAD & CO UAB Vilnius, Lithuania Director Julius Beinoravičius.

VILNIUS UNIVERSITY (Lithuania) · Vilnius University . Geographical situation: Lithuania is a state in Northern Europe, since 2004 member of EU . Population: 3,39 mln, Capital: Vilnius

Nuclear physics activities in Lithuania - c.ymcdn.comc.ymcdn.com/sites/ physics activities in Lithuania Arturas Plukis Institute of Physics, Savanoriu pr. 231, Vilnius, 02300. Lithuania

VILNIUS ANTAKALNIS GYMNASIUM – YWCA-YMCA of Lithuania - YMCA Finland.

International School of Law and Business Vilnius, Lithuania.

SPECIFIC AREAS ON THE TERRITORY OF LITHUANIA · 2020-05-15 · Vilnius Gediminas Technical University, Sauletekio al. 11, LT-10223, Vilnius, Lithuania Corresponding author : Received

IGF 2010 Vilnius, Lithuania · 2014. 7. 10. · IGF 2010 Vilnius, Lithuania Patrik Fältström. Timeline - meetings • 10-11 May 2010, Open meeting • 12 May 2010, Closed MAG meeting

MINDS.TS info course in Vilnius, Lithuania

Rasa Ruseckiene Old age psychiatrist psychotherapist University hospital of Vilnius Vilnius University Lithuania, 2013.

Vilnius - The Capital of Lithuania

LITHUANIA VILNIUS 18 MARCH · CITIZEN / CUSTOMER SATISFACTION MANAGEMENT EUROPEAN EXPERIENCES AND INSIGHTS LITHUANIA ‐ VILNIUS, 18 MARCH 2010 Public sector organisations are challenged