Very Sparse Random Projections

10
 Very Sparse Random Projections Ping Li Department of Statistics Stanford University Stanford CA 94305, USA [email protected] Trevor J. Hastie Department of Statistics Stanford University Stanford CA 94305, USA [email protected] Kenneth W. Church Microsoft Research Microsoft Corporation Redmond WA 98052, USA [email protected] ABSTRACT There has been considerable interest in random projections, an approximate algorithm for estimating distances between pai rs of point s in a hig h-dimensi ona l ve cto r spa ce . Le t A ∈  R n×D be our  n  points in D dimen sions. The method multiplies  A by a random matrix R ∈  R D×k , reducing the D  dimensions down to just  k  for speeding up the compu- tation.  R  typically consists of entries of standard normal N (0, 1). It is we ll known that random projecti ons preser ve pairwise distances (in the expectation). Achlioptas proposed sparse random projections  by replacing the  N (0, 1) entries in  R with entries in {1, 0, 1}  with probabilities { 1 6 ,  2 3 ,  1 6 }, achieving a threefold speedup in processing time. We recommend using R of entries in {1, 0, 1} with prob- abilities {  1 2 √ D , 1 1 √ D ,  1 2 √ D } for achieving a signicant √ D- fold speedup, with little loss in accuracy. Categories and Subject Descriptors H.2.8 [Database Applications]: Data Mining General Terms Algorithms, Performance, Theory Keywords Random projections, Sampling, Rates of convergence 1. INTRODUCTION Rando m projec tion s [1, 43] hav e been used in Machine Learning [2,4,5,13,14,22], VLSI layout [42], analysis of La- ten t Seman tic Indexing (LSI) [35], set inters ecti ons [7, 36], nding motifs in bio-sequences [6,27], face recognitio n [16], privacy preserving distributed data mining [31], to name a few. The AMS sket chi ng algorit hm [3] is also one form of random projections. We dene a data matrix A of size n × D to be a collection of  n  data points {u i } n i=1  ∈  R D . All pairw ise distan ces can Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republi sh, to post on servers or to redistribu te to lists, requires prior specic permission and/or a fee. KDD’06, August 20–23, 2006, Philadel phia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-339-5/06/0008 ... $5.00. be computed as AA T , at the cost of time  O(n 2 D), which is often prohibitive for large n and  D, in modern data mining and information retrieval applications. To speed up the computations, one can generate a ran- dom projection matrix R R D×k and multiply it with the original matrix A R n×D to obtain a projected data matrix B =  1 √ k AR R n×k , k  min(n, D).  (1) The (much smaller) matrix  B  preserves all pairwise dis- tanc es of  A  in expec tatio ns, provided that  R  consi sts of i.i. d. ent ries with zero mean and cons tan t varian ce. Th us, we can achieve a substantial cost reduction for computing AA T , from O(n 2 D) to O(nDk + n 2 k). In information retrieval, we often do not have to materi- alize AA T . Instead, databases and search engines are inter- ested in storing the projected data  B in main memory for eci ent ly responding to input queries. While the origina l data matrix A  is often too large, the projected data matrix B can be small enough to reside in the main memory. The entries of  R (denoted by {rji } D j=1 k i=1 ) should be i.i.d. with zero mean. In fact, this is the only necessary condi- tion for pres ervi ng pairw ise dist ance s [4]. How eve r, die r- ent choices of  r ji  can change the variances (average errors) and error tail bounds. It is often convenient to let r ji  follow a symme tric distri butio n about zero with unit varia nce. A “simple” distribution is the standard normal 1 , i.e., r ji ∼ N (0, 1),  E(r ji ) = 0,  E ` r 2 ji ´ = 1,  E ` r 4 ji ´ = 3 . It is “simple” in terms of theoretical analysis, but not in terms of rando m number gene rati on. F or example , a uni- form distri butio n is easier to generate than normals, but the analysis is more dicult. In this paper, when  R consists of normal entries, we call thi s spe cia l cas e as the  conven tional ran dom proje ctions , about which many theo retical result s are known. See the monograph by Vempala [43] for further references. We derive some theoretical results when R is not restricte d to normals. In particular, our results lead to signicant im- provements over the so-called  sparse random projections . 1.1 Spars e Random Proj ectio ns In his novel work, Achlioptas [1] proposed using the pro- 1 The normal distribution is  2-stable . It is one of the few stable distributions that have closed-form density [19].

description

Research Paper

Transcript of Very Sparse Random Projections

  • Very Sparse Random ProjectionsPing Li

    Department of StatisticsStanford University

    Stanford CA 94305, [email protected]

    Trevor J. HastieDepartment of Statistics

    Stanford UniversityStanford CA 94305, [email protected]

    Kenneth W. ChurchMicrosoft Research

    Microsoft CorporationRedmond WA 98052, [email protected]

    ABSTRACTThere has been considerable interest in random projections,an approximate algorithm for estimating distances betweenpairs of points in a high-dimensional vector space. LetA RnD be our n points in D dimensions. The methodmultiplies A by a random matrix R RDk, reducing theD dimensions down to just k for speeding up the compu-tation. R typically consists of entries of standard normalN(0, 1). It is well known that random projections preservepairwise distances (in the expectation). Achlioptas proposedsparse random projections by replacing the N(0, 1) entriesin R with entries in {1, 0, 1} with probabilities { 1

    6, 2

    3, 1

    6},

    achieving a threefold speedup in processing time.We recommend using R of entries in {1, 0, 1} with prob-

    abilities { 12

    D, 1 1

    D, 1

    2

    D} for achieving a significant

    D-

    fold speedup, with little loss in accuracy.

    Categories and Subject DescriptorsH.2.8 [Database Applications]: Data Mining

    General TermsAlgorithms, Performance, Theory

    KeywordsRandom projections, Sampling, Rates of convergence

    1. INTRODUCTIONRandom projections [1, 43] have been used in Machine

    Learning [2,4,5,13,14,22], VLSI layout [42], analysis of La-tent Semantic Indexing (LSI) [35], set intersections [7, 36],finding motifs in bio-sequences [6, 27], face recognition [16],privacy preserving distributed data mining [31], to name afew. The AMS sketching algorithm [3] is also one form ofrandom projections.

    We define a data matrix A of size nD to be a collectionof n data points {ui}ni=1 RD. All pairwise distances can

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD06, August 2023, 2006, Philadelphia, Pennsylvania, USA.Copyright 2006 ACM 1-59593-339-5/06/0008 ...$5.00.

    be computed as AAT, at the cost of time O(n2D), which isoften prohibitive for large n and D, in modern data miningand information retrieval applications.

    To speed up the computations, one can generate a ran-dom projection matrix R RDk and multiply it with theoriginal matrix A RnD to obtain a projected data matrix

    B =1kAR Rnk, k min(n, D). (1)

    The (much smaller) matrix B preserves all pairwise dis-tances of A in expectations, provided that R consists ofi.i.d. entries with zero mean and constant variance. Thus,we can achieve a substantial cost reduction for computingAAT, from O(n2D) to O(nDk + n2k).

    In information retrieval, we often do not have to materi-alize AAT. Instead, databases and search engines are inter-ested in storing the projected data B in main memory forefficiently responding to input queries. While the originaldata matrix A is often too large, the projected data matrixB can be small enough to reside in the main memory.

    The entries of R (denoted by {rji}Dj=1 ki=1) should be i.i.d.with zero mean. In fact, this is the only necessary condi-tion for preserving pairwise distances [4]. However, differ-ent choices of rji can change the variances (average errors)and error tail bounds. It is often convenient to let rji followa symmetric distribution about zero with unit variance. Asimple distribution is the standard normal1, i.e.,

    rji N(0, 1), E (rji) = 0, E`r2ji

    = 1, E`r4ji

    = 3.

    It is simple in terms of theoretical analysis, but not interms of random number generation. For example, a uni-form distribution is easier to generate than normals, butthe analysis is more difficult.

    In this paper, when R consists of normal entries, we callthis special case as the conventional random projections,about which many theoretical results are known. See themonograph by Vempala [43] for further references.

    We derive some theoretical results when R is not restrictedto normals. In particular, our results lead to significant im-provements over the so-called sparse random projections.

    1.1 Sparse Random ProjectionsIn his novel work, Achlioptas [1] proposed using the pro-

    1The normal distribution is 2-stable. It is one of the fewstable distributions that have closed-form density [19].

  • jection matrix R with i.i.d entries in

    rji =

    s

    80. Its kurtosis = 6

    5.

    A discrete uniform distribution symmetric about zero,with N points. Its kurtosis = 6

    5N2+1N21 , ranging be-

    tween -2 (when N = 2) and 65

    (when N ). Thecase with N = 2 is the same as (2) with s = 1.

    Discrete and continuous U-shaped distributions.4Computing all marginal norms of A costs O(nD), whichis often negligible. As important summary statistics, themarginal norms may be already computed during variousstage of processing, e.g., normalization and term weighting.5Note that the kurtosis can not be smaller than 2 becauseof the Cauchy-Schwarz inequality: E2(r2ji) E(r4ji). Onemay consult http://en.wikipedia.org/wiki/Kurtosis for refer-ences to kurtosis of various distributions.

  • 5. HEAVY-TAIL AND TERM WEIGHTINGThe very sparse random projections are useful even for

    heavy-tailed data, mainly because of term weighting.We have seen that bounded forth and third moments are

    needed for analyzing the convergence of moments (variance)and the convergence to normality, respectively. The proofof asymptotic normality in Appendix A suggests that weonly need stronger than bounded second moments to ensureasymptotic normality. In heavy-tailed data, however, eventhe second moment may not exist.

    Heavy-tailed data are ubiquitous in large-scale data min-ing applications (especially Internet data) [25,34]. The pair-wise distances computed from heavy-tailed data are usuallydominated by outliers, i.e., exceptionally large entries.

    Pairwise vector distances are meaningful only when alldimensions of the data are more or less equally important.For heavy-tailed data, such as the (unweighted) term-by-document matrix, pairwise distances may be misleading.Therefore, in practice, various term weighting schemes areproposed e.g., [33, Chapter 15.2] [10, 30, 39, 45], to weightthe entries instead of using the original data.

    It is well-known that choosing an appropriate term weight-ing method is vital. For example, as shown in [23, 26], intext categorization using support vector machine (SVM),choosing an appropriate term weighting scheme is far moreimportant than tuning kernel functions of SVM. See similarcomments in [37] for the work on Naive Bayes text classifier.

    We list two popular and simple weighting schemes. Onevariant of the logarithmic weighting keeps zero entries andreplaces any non-zero count with 1+log(original count). An-other scheme is the square root weighting. In the same spiritof the Box-Cox transformation [44, Chapter 6.8], these vari-ous weighting schemes significantly reduce the kurtosis (andskewness) of the data and make the data resemble normal.

    Therefore, it is fair to say that assuming finite moments(third or fourth) is reasonable whenever the computed dis-tances are meaningful.

    However, there are also applications in which pairwise dis-tances do not have to bear any clear meaning. For example,using random projections to estimate the joint sizes (setintersections). If we expect the original data are severelyheavy-tailed and no term weighting will be applied, we rec-ommend using s = O(1).

    Finally, we shall point out that very sparse random pro-jections can be fairly robust against heavy-tailed data whens =

    D. For example, instead of assuming finite fourth mo-

    ments, as long as DPD

    j=1 u41,j

    (PD

    j=1 u21,j)

    2 grows slower than O(

    D),

    we can still achieve the convergence of variances if s =

    D,in Lemma 1. Similarly, analyzing the rate of converge to

    normality only requires that

    DPD

    j=1 |u1,j |3

    (PD

    j=1 u21,j)

    3/2 grows slower

    than O(D1/4). An even weaker condition is needed to onlyensure asymptotic normality. We provide some additionalanalysis on heavy-tailed data in Appendix B.

    6. EXPERIMENTAL RESULTSSome experimental results are presented as a sanity check,

    using one pair of words, THIS and HAVE, from tworows of a term-by-document matrix provided by MSN. D =216 = 65536. That is, u1,j (u2,j) is the number of occur-rences of word THIS (word HAVE) in the jth document

    (Web page), j = 1 to D. Some summary statistics are listedin Table 1.

    The data are certainly heavy-tailed as the kurtoses foru1,j and u2,j are 195 and 215, respectively, far above zero.Therefore we do not expect that very sparse random projec-tions with s = D

    log D 6000 work well, though the results

    are actually not disastrous as shown in Figure 1(d).

    Table 1: Some summary statistics of the word pair,THIS (u1) and HAVE (u2). 2 denotes the

    kurtosis. (u1,j , u2,j) =E(u21,ju

    22,j)

    E(u21,j)E(u

    22,j)+E

    2(u1,ju2,j), af-

    fects the convergence of Var`vT1 v2

    (see the proof of

    Lemma 3). These expectations are computed empir-ically from the data. Two popular term weightingschemes are applied. The square root weightingreplaces u1,j with

    u1,j and the logarithmic weight-

    ing replaces any non-zero u1,j with 1 + log u1,j .

    Unweighted Square root Logarithmic2(u1,j) 195.1 13.03 1.582(u2,j) 214.7 17.05 4.15E(u41,j)

    E2(u21,j

    )180.2 12.97 5.31

    E(u42,j)

    E2(u22,j

    )205.4 18.43 8.21

    (u1,j , u2,j) 78.0 7.62 3.34cos((u1, u2)) 0.794 0.782 0.754

    We first test random projections on the original (unweighted,

    heavy-tailed) data, for s = 1, 3, 256 =

    D and 6000 Dlog D

    ,presented in Figure 1. We then apply square root weightingand logarithmic weighting before random projections. Theresults are presented in Figure 2, for s = 256 and s = 6000.

    These results are consistent with what we would expect:

    When s is small, i.e., O(1), sparse random projectionsperform very similarly to conventional random projec-tions as shown in panels (a) and (b) of Figure 1 .

    With increasing s, the variances of sparse random pro-jections increase. With s = D

    log D, the errors are large

    (but not disastrous), because the data are heavy-tailed.

    With s =

    D, sparse random projections are robust.

    Since cos((u1, u2)) 0.7 0.8 in this case, marginalinformation can improve the estimation accuracy quitesubstantially. The asymptotic variances of aMLE matchthe empirical variances of the asymptotic MLE estima-tor quite well, even for s =

    D.

    After applying term weighting on the original data,sparse random projections are almost as accurate asconventional random projections, even for s D

    log D,

    as shown in Figure 2.

    7. CONCLUSIONWe provide some new theoretical results on random pro-

    jections, a randomized approximate algorithm widely usedin machine learning and data mining. In particular, ourtheoretical results suggest that we can achieve a significant

    D-fold speedup in processing time with little loss in accu-racy, where D is the original data dimension. When the data

  • 10 10000.10.20.30.40.50.60.7

    k

    Stan

    dard

    erro

    rs

    MFMLETheor. MFTheor.

    (a) s = 1

    10 10000.10.20.30.40.50.60.7

    k

    Stan

    dard

    erro

    rs

    MFMLETheor. MFTheor.

    (b) s = 3

    10 10000.10.20.30.40.50.60.7

    k

    Stan

    dard

    erro

    rs

    MFMLETheor. MFTheor.

    (c) s = 256

    10 1000

    0.5

    1

    1.5

    k

    Stan

    dard

    erro

    rs

    MFMLETheor. MFTheor.

    (d) s = 6000

    Figure 1: Two words THIS (u1) and HAVE (u2)from the MSN Web crawl data are tested. D = 216.Sparse random projections are applied to estimateda = uT1 u2, with four values of s: 1, 3, 256 =

    D

    and 6000 Dlog D

    , in panels (a), (b), (c) and (d),respectively, presented in terms of the normalized

    standard error,

    Var(a)

    a. 104 simulations are con-

    ducted for each k, ranging from 10 to 100. Thereare five curves in each panel. The two labeled asMF and Theor. overlap. MF stands for theempirical variance of the Margin-free estimatoraMF ; while Theor. MF for the theoretical vari-ance of aMF , i.e., (27). The solid curve, labeled asMLE, presents the empirical variance of aMLE, theestimator using margins as formulated in Lemma 7.There are two curves both labeled as Theor. ,for the asymptotic theoretical variances of aMF (thehigher curve, (28)) and aMLE (the lower curve, (30)).

    are free of outliers (e.g., after careful term weighting), acost reduction by a factor of D

    log Dis also possible.

    Our proof of the asymptotic normality justifies the use ofan asymptotic maximum likelihood estimator for improvingthe estimates when the marginal information is available.

    8. ACKNOWLEDGMENTWe thank Dimitris Achlioptas for very insightful com-

    ments. We thank Xavier Gabaix and David Mason for point-ers to useful references. Ping Li thanks the enjoyable andhelpful conversations with Tze Leung Lai, Joseph P. Ro-mano, and Yiyuan She. Finally, we thank the four anony-mous reviewers for constructive suggestions.

    9. REFERENCES[1] Dimitris Achlioptas. Database-friendly random projections:

    Johnson-Lindenstrauss with binary coins. Journal ofComputer and System Sciences, 66(4):671687, 2003.

    10 10000.10.20.30.40.50.60.7

    k

    Stan

    dard

    erro

    rs

    MFMLETheor. MFTheor.

    (a) Square root (s = 256)

    10 10000.10.20.30.40.50.60.7

    k

    Stan

    dard

    erro

    rs

    MFMLETheor. MFTheor.

    (b) Logarithmic (s = 256)

    10 10000.10.20.30.40.50.60.7

    k

    Stan

    dard

    erro

    rs

    MFMLETheor. MFTheor.

    (c) Square root (s = 6000)

    10 10000.10.20.30.40.50.60.7

    k

    Stan

    dard

    erro

    rs

    MFMLETheor. MFTheor.

    (d) Logarithmic (s = 6000)

    Figure 2: After applying term weighting on the orig-inal data, sparse random projections are almost asaccurate as conventional random projections, evenfor s = 6000 D

    log D. Note that the legends are the

    same as in Figure 1.

    [2] Dimitris Achlioptas, Frank McSherry, and BernhardScholkopf. Sampling techniques for kernel methods. In Proc.of NIPS, pages 335342, Vancouver, BC, Canada, 2001.

    [3] Noga Alon, Yossi Matias, and Mario Szegedy. The spacecomplexity of approximating the frequency moments. InProc. of STOC, pages 2029, Philadelphia,PA, 1996.

    [4] Rosa Arriaga and Santosh Vempala. An algorithmic theoryof learning: Robust concepts and random projection. InProc. of FOCS (Also to appear in Machine Learning),pages 616623, New York, 1999.

    [5] Ella Bingham and Heikki Mannila. Random projection indimensionality reduction: Applications to image and textdata. In Proc. of KDD, pages 245250, San Francisco, CA,2001.

    [6] Jeremy Buhler and Martin Tompa. Finding motifs usingrandom projections. Journal of Computational Biology,9(2):225242, 2002.

    [7] Moses S. Charikar. Similarity estimation techniques fromrounding algorithms. In Proc. of STOC, pages 380388,Montreal, Quebec, Canada, 2002.

    [8] G. P. Chistyakov and F. Gotze. Limit distributions ofstudentized means. The Annals of Probability,32(1A):2877, 2004.

    [9] Sanjoy Dasgupta and Anupam Gupta. An elementary proofof a theorem of Johnson and Lindenstrauss. RandomStructures and Algorithms, 22(1):60 65, 2003.

    [10] Susan T. Dumais. Improving the retrieval of informationfrom external sources. Behavior Research Methods,Instruments and Computers, 23(2):229236, 1991.

    [11] Richard Durrett. Probability: Theory and Examples.Duxbury Press, Belmont, CA, second edition, 1995.

    [12] William Feller. An Introduction to Probability Theory andIts Applications (Volume II). John Wiley & Sons, NewYork, NY, second edition, 1971.

    [13] Xiaoli Zhang Fern and Carla E. Brodley. Random

  • projection for high dimensional data clustering: A clusterensemble approach. In Proc. of ICML, pages 186193,Washington, DC, 2003.

    [14] Dmitriy Fradkin and David Madigan. Experiments withrandom projections for machine learning. In Proc. of KDD,pages 517522, Washington, DC, 2003.

    [15] P. Frankl and H. Maehara. The Johnson-Lindenstrausslemma and the sphericity of some graphs. Journal ofCombinatorial Theory A, 44(3):355362, 1987.

    [16] Navin Goel, George Bebis, and Ara Nefian. Facerecognition experiments with random projection. In Proc.of SPIE, pages 426437, Bellingham, WA, 2005.

    [17] Michel X. Goemans and David P. Williamson. Improvedapproximation algorithms for maximum cut andsatisfiability problems using semidefinite programming.Journal of ACM, 42(6):11151145, 1995.

    [18] F. Gotze. On the rate of convergence in the multivariateCLT. The Annals of Probability, 19(2):724739, 1991.

    [19] Piotr Indyk. Stable distributions, pseudorandomgenerators, embeddings and data stream computation. InFOCS, pages 189197, Redondo Beach,CA, 2000.

    [20] Piotr Indyk and Rajeev Motwani. Approximate nearestneighbors: Towards removing the curse of dimensionality.In Proc. of STOC, pages 604613, Dallas, TX, 1998.

    [21] W. B. Johnson and J. Lindenstrauss. Extensions ofLipschitz mapping into Hilbert space. ContemporaryMathematics, 26:189206, 1984.

    [22] Samuel Kaski. Dimensionality reduction by randommapping: Fast similarity computation for clustering. InProc. of IJCNN, pages 413418, Piscataway, NJ, 1998.

    [23] Man Lan, Chew Lim Tan, Hwee-Boon Low, and Sam YuanSung. A comprehensive comparative study on termweighting schemes for text categorization with supportvector machines. In Proc. of WWW, pages 10321033,Chiba, Japan, 2005.

    [24] Erich L. Lehmann and George Casella. Theory of PointEstimation. Springer, New York, NY, second edition, 1998.

    [25] Will E. Leland, Murad S. Taqqu, Walter Willinger, andDaniel V. Wilson. On the self-similar nature of Ethernettraffic. IEEE/ACM Trans. Networking, 2(1):115, 1994.

    [26] Edda Leopold and Jorg Kindermann. Text categorizationwith support vector machines. how to represent texts ininput space? Machine Learning, 46(1-3):423444, 2002.

    [27] Henry C.M. Leung, Francis Y.L. Chin, S.M. Yiu, RoniRosenfeld, and W.W. Tsang. Finding motifs withinsufficient number of strong binding sites. Journal ofComputational Biology, 12(6):686701, 2005.

    [28] Ping Li, Trevor J. Hastie, and Kenneth W. Church.Improving random projections using marginal information.In Proc. of COLT, Pittsburgh, PA, 2006.

    [29] Jessica Lin and Dimitrios Gunopulos. Dimensionalityreduction by random projection and latent semanticindexing. In Proc. of SDM, San Francisco, CA, 2003.

    [30] Bing Liu, Yiming Ma, and Philip S. Yu. Discoveringunexpected information from your competitors web sites.In Proc. of KDD, pages 144153, San Francisco, CA, 2001.

    [31] Kun Liu, Hillol Kargupta, and Jessica Ryan. Randomprojection-based multiplicative data perturbation forprivacy preserving distributed data mining. IEEETransactions on Knowledge and Data Engineering,18(1):92106, 2006.

    [32] B. F. Logan, C. L. Mallows, S. O. Rice, and L. A. Shepp.Limit distributions of self-normalized sums. The Annals ofProbability, 1(5):788809, 1973.

    [33] Chris D. Manning and Hinrich Schutze. Foundations ofStatistical Natural Language Processing. The MIT Press,Cambridge, MA, 1999.

    [34] M. E. J. Newman. Power laws, pareto distributions andzipfs law. Contemporary Physics, 46(5):232351, 2005.

    [35] Christos H. Papadimitriou, Prabhakar Raghavan, HisaoTamaki, and Santosh Vempala. Latent semantic indexing:

    A probabilistic analysis. In Proc. of PODS, pages 159168,Seattle,WA, 1998.

    [36] Deepak Ravichandran, Patrick Pantel, and Eduard Hovy.Randomized algorithms and NLP: Using locality sensitivehash function for high speed noun clustering. In Proc. ofACL, pages 622629, Ann Arbor, MI, 2005.

    [37] Jason D. Rennie, Lawrence Shih, Jaime Teevan, andDavid R. Karger. Tackling the poor assumptions of naiveBayes text classifiers. In Proc. of ICML, pages 616623,Washington, DC, 2003.

    [38] Ozgur D. Sahin, Aziz Gulbeden, Fatih Emekci, DivyakantAgrawal, and Amr El Abbadi. Prism: indexingmulti-dimensional data in p2p networks using referencevectors. In Proc. of ACM Multimedia, pages 946955,Singapore, 2005.

    [39] Gerard Salton and Chris Buckley. Term-weightingapproaches in automatic text retrieval. Inf. Process.Manage., 24(5):513523, 1988.

    [40] I. S. Shiganov. Refinement of the upper bound of theconstant in the central limit theorem. Journal ofMathematical Sciences, 35(3):25452550, 1986.

    [41] Chunqiang Tang, Sandhya Dwarkadas, and Zhichen Xu. Onscaling latent semantic indexing for large peer-to-peersystems. In Proc. of SIGIR, pages 112121, Sheffield, UK,2004.

    [42] Santosh Vempala. Random projection: A new approach toVLSI layout. In Proc. of FOCS, pages 389395, Palo Alto,CA, 1998.

    [43] Santosh Vempala. The Random Projection Method.American Mathematical Society, Providence, RI, 2004.

    [44] William N. Venables and Brian D. Ripley. Modern AppliedStatistics with S. Springer-Verlag, New York, NY, fourthedition, 2002.

    [45] Clement T. Yu, K. Lam, and Gerard Salton. Termweighting in information retrieval using the term precisionmodel. Journal of ACM, 29(1):152170, 1982.

    APPENDIXA. PROOFS

    Let {ui}ni=1 denote the rows of the data matrix A RnD.A projection matrix R RDk consists of i.i.d. entries rji:

    Pr(rji =

    s) = Pr(rji =

    s) =1

    2s, Pr(rji = 0) = 1 1

    s,

    E(rji) = 0, E(r2ji) = 1, E(r

    4ji) = s, E(|r3ji|) =

    s,

    E (rji rji ) = 0, E`r2ji rji

    = 0 when i 6= i, or j 6= j.

    We denote the projected data vectors by vi =1kRTui.

    For convenience, we denote

    m1 = u12 =DX

    j=1

    u21,j , m2 = u22 =DX

    j=1

    u22,j ,

    a = uT1 u2 =DX

    j=1

    u1,ju2,j , d = u1 u22 = m1 + m2 2a.

    We will always assume

    s = o(D), E(u41,j) < , E(u42,j) < , ( E(u21,ju22,j) < ).By the strong law of large numbersPDj=1 u

    I1,j

    D E

    uI1,j

    ,

    PDj=1 (u1,j u2,j)I

    D E (u1,j u2,j)I ,PD

    j=1 (u1,ju2,j)J

    D E (u1,ju2,j)J , a.s. I = 2, 4, J = 1, 2.

  • A.1 MomentsThe following expansions are useful for proving the next

    three lemmas.

    m1m2 =

    DXj=1

    u21,j

    DXj=1

    u22,j =

    DXj=1

    u21,ju22,j +

    DXj 6=j

    u21,ju22,j ,

    m21 =

    DX

    j=1

    u21,j

    !2=

    DXj=1

    u41,j + 2Xj

  • =E`v21,iv

    22,i

    =

    1

    k2

    0@s DX

    j=1

    u21,ju22,j + 4

    Xj 0.

    Let s2D =PD

    j=1 Var(zj) =PD

    j=1 u21,j

    k= m1

    k. Assume the

    Lindeberg condition

    1

    s2D

    DXj=1

    E`z2j ; |zj | sD

    0, for any > 0.Then PD

    j=1 zj

    sD=

    v1,ipm1/k

    L= N(0, 1),

    6The best Berry-Esseen constant 0.7915 ( 0.8) is from [40].

    which immediately leads to

    v21,im1/k

    L= 21, v1

    2

    m1/k=

    kXi=1

    v21,i

    m1/k

    L

    = 2k.

    We need to go back and check the Lindeberg condition.

    1

    s2D

    DXj=1

    E`z2j ; |zj | sD

    1s2D

    DXj=1

    E

    |zj |2+(sD)

    = s

    D

    2 1

    PDj=1 |u1,j |2+/DPD

    j=1 u21,j/D

    (2+)/2

    o(D)

    D

    2 1

    E|u1,j |2+`

    E(u21,j)(2+)/2 0,

    provided E|u1,j |2+ < , for some > 0, which is muchweaker than our assumption that E(u41,j) < .

    It remains to show the rate of convergence using the Berry-

    Esseen theorem. Let D =PD

    j=1 E|zj |3 = s1/2

    k3/2

    PDj=1 |u1,j |3

    |Fv1,i (y) (y)| 0.8Ds3D

    = 0.8

    s

    PDj=1 |u1,j |3

    m3/21

    0.8r

    s

    D

    E|u1,j |3`E`u21,j

    3/2 0.

    Lemma 5. As D ,v1,i v2,ip

    d/k

    L= N(0, 1), v1 v2

    2

    d/k

    L= 2k,

    with the rate of convergence

    |Fv1,iv2,i (y) (y)| 0.8

    s

    PDj=1 |u1,j u2,j |3

    d3/2

    0.8r

    s

    D

    E|u1,j u2,j |3E

    32 (u1,j u2,j)2

    0.

    Proof of Lemma 5. The proof is analogous to the proofof Lemma 4.

    The next lemma concerns the joint distribution of (v1,i, v2,i).

    Lemma 6. As D ,

    12

    v1,iv2,i

    L

    = N

    00

    ,

    1 00 1

    , =

    1

    k

    m1 aa m2

    and

    Pr (sign(v1,i) = sign(v2,i)) 1 pi

    , = cos1

    am1m2

    .

    Proof of Lemma 6. We have seen that Var (v1,i) =m1k

    ,Var (v2,i) =

    m2k

    , E (v1,iv2,i) =ak, i.e.,

    cov

    v1,iv2,i

    =

    1

    k

    m1 aa m2

    = .

    The Lindeberg multivariate central limit theorem [18] says

    12

    v1,iv2,j

    L

    = N

    00

    ,

    1 00 1

    .

  • The multivariate Lindeberg condition is automatically satis-fied by assuming bounded third moments of u1,j and u2,j . Atrivial consequence of the asymptotic normality yields

    Pr (sign(v1,i) = sign(v2,i)) 1 pi

    .

    Strictly speaking, we should write = cos1

    E(u1,ju2,j)qE(u21,j)E(u

    22,j)

    .

    A.3 An Asymptotic MLE Using MarginsLemma 7. Assuming that the margins, m1 and m2 are

    known and using the asymptotic normality of (v1,i, v2,i), wecan derive an asymptotic maximum likelihood estimator (MLE),which is the solution to a cubic equation

    a3 a2vT1 v2

    + a

    `m1m2 + m1v22 + m2v12m1m2vT1 v2 = 0,

    Denoted by aMLE, the asymptotic variance of this estima-tor is

    Var (aMLE) =1

    k

    `m1m2 a2

    2m1m2 + a2

    .

    Proof of Lemma 7. For notational convenience, we treat(v1,i, v2,i) as exactly normally distributed so that we do notneed to keep track of the convergence notation.

    The likelihood function of {v1,i, v2,i}ki=1 is thenlik{v1,i, v2,i}ki=1

    = (2pi)

    k2 || k2

    exp

    1

    2

    kXi=1

    v1,i v2,i

    1

    v1,iv2,i

    !.

    where

    =1

    k

    m1 aa m2

    .

    We can then express the log likelihood function, l(a), as

    log lik{v1,i, v2,i}ki=1

    l(a) = k

    2log`m1m2 a2

    k

    2

    1

    m1m2 a2kX

    i=1

    `v21,im2 2v1,iv2,ia + v22,im1

    ,

    The MLE equation is the solution to l(a) = 0, which is

    a3 a2vT1 v2

    + a

    `m1m2 + m1v22 + m2v12m1m2vT1 v2 = 0

    The large sample theory [24, Theorem 6.3.10] says thataMLE is asymptotically unbiased and converges in distribu-

    tion to a normal random variable Na, 1

    I(a)

    , where I(a),

    the expected Fisher Information, is

    I(a) = E `l(a) = k m1m2 + a2(m1m2 a2)2

    ,

    after some algebra.Therefore, the asymptotic variance of aMLE would be

    Var (aMLE) =1

    k

    `m1m2 a2

    2m1m2 + a2

    . (32)

    B. HEAVY-TAILED DATAWe illustrate that very sparse random projections are fairly

    robust against heavy-tailed data, by a Pareto distribution.The assumption of finite moments has simplified the anal-

    ysis of convergence a great deal. For example, assuming( + 2)th moment, 0 < 2 and s = o(D), we have

    (s)/2PD

    j=1 |u1,j |2+PDj=1(u

    21,j)1+/2 = sD

    /2 PDj=1 |u1,j |2+/DPD

    j=1(u21,j)/D

    1+/2 s

    D

    /2 E `u2+1,j `E`u21,j

    1+/2 0. (33)Note that = 2 corresponds to the rate of convergence

    for the variance in Lemma 1, and = 1 corresponds to therate of convergence for asymptotic normality in Lemma 4.From the proof of Lemma 4 in Appendix A, we can see thatthe convergence of (33) (to zero) with any > 0 suffices forachieving asymptotic normality.

    For heavy-tailed data, the fourth moment (or even thesecond moment) may not exist. The most common model forheavy-tailed data is the Pareto distribution with the densityfunction7 f(x; ) =

    x+1, whose mth moment =

    m , onlydefined if > m. The measurements of for many types ofdata are available in [34]. For example, = 1.2 for the wordfrequency, = 2.04 for the citations to papers, = 2.51 forthe copies of books sold in the US, etc.

    For simplicity, we assume that 2 < 2 + 4. Un-der this assumption, the asymptotic normality is guaranteedand it remains to show the rate of convergence of momentsand distributions. In this case, the second moment E

    `u21,j

    exists. The sum

    PDj=1 |u1,j |2+ grows as O

    D(2+)/

    as

    shown in [11, Example 2.7.4].8 Thus, we can write

    s/2PD

    j=1 |u1,j |2+PDj=1(u

    21,j)1+/2 = O

    s/2

    D1+/22+

    =

    8