Inf - DCCUChilegnavarro/ps/cir02.pdf · 2 Our Clustering Metho d 5 2.1 Clustering in Metric Spaces....

Information Retrieval and Clustering (pp. { - {)W. Wu, H. Xiong and S. Shekhar(Eds.) 2002 Kluwer A ademi PublishersClustering in Metri Spa eswith Appli ations to Information RetrievalRi ardo Baeza-YatesBenjam��n BustosCenter for Web Resear h, Dept. of Computer S ien eUniversidad de Chile, Blan o En alada 2120, Santiago, ChileE-mail: frbaeza,bebustosg�d .u hile. lEdgar Ch�avezUniversidad Mi hoa ana, Morelia, M�exi oE-mail: el havez�fismat.umi h.mxNorma HerreraUniv. Na ional de San Luis, San Luis, ArgentinaE-mail: nherrera�unsl.edu.arGonzalo NavarroCenter for Web Resear h, Dept. of Computer S ien eUniversidad de Chile, Blan o En alada 2120; Santiago, ChileE-mail: gnavarro�d .u hile. l

1

Contents1 Introdu tion 32 Our Clustering Method 52.1 Clustering in Metri Spa es . . . . . . . . . . . . . . . . . . . . . . . 72.2 Mutual k-Nearest Neighbor Graph Clustering Algorithm . . . . . . . 82.2.1 The Clustering Algorithm . . . . . . . . . . . . . . . . . . . . 102.2.2 Conne tivity Properties . . . . . . . . . . . . . . . . . . . . . 112.3 The Range r Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Outliers, Equivalen e Classes and Stability . . . . . . . . . . 132.4 Radius vs. Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 The Conne tivity Parameters . . . . . . . . . . . . . . . . . . . . . . 142.6 Intrinsi Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Morphologi al Stemming and the Holomorphi Distan e 163.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 The Holomorphi Transformation . . . . . . . . . . . . . . . . . . . . 173.3 A Morphologi al Stemmer Using Clustering . . . . . . . . . . . . . . 184 Clustering for Approximate Proximity Sear h 184.1 The Ve tor Model for Information Retrieval . . . . . . . . . . . . . . 214.2 Te hniques for Approximate Proximity Sear hing . . . . . . . . . . . 224.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Clustering for Metri Index Boosting 265.1 GNATs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.2 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 29Referen es

2

1 Introdu tionThe on ept of luster is elusive. The dire t impli ation of this is a largediversity of lustering te hniques, ea h serving a parti ular de�nition. Mostte hniques fo us on a global optimization fun tion. The general pro edureis to propose a lustering (using a suitable algorithm), then to measure thequality and amount of lusters, and repeat the pro edure (proposing a new lustering stru ture, using for example new parameters) until satis�ed.This setup is satisfa tory for many appli ations. Traditional lusteringhas been used in Information Retrieval (IR) for many di�erent purposes,su h as query expansion, do ument ategorization and lassi� ation, rel-evan e feedba k, visualization of results, and so on. However, there are anumber of appli ations in IR where new lustering methods would be useful.In this hapter we propose a non-traditional lustering method, whi his aimed at lustering data from a general metri spa e. This means thatall the information we an use for dis overing the lustering stru ture isthe matrix of distan es between all the data pairs in the set. Moreover,we annot reate new data points su h as entroids. This is in ontrastwith traditional lustering methods (su h as the k-means te hnique), whi hoperate on oordinate spa es and may require to reate new data points.As a result, our te hnique is more general and an be used in appli ationswhere traditional lustering methods annot be applied.Our approa h is aimed at lustering multimedia data. For this we meandata with no oordinates or dimensional information whatsoever. Several lustering algorithms exist for this s enario, for example, graph theoreti almethods su h as Zhan's algorithm [Zha71℄. However, in several IR appli- ations we need to go further. First, we may need to obtain the luster ofa given element (whi h will be regarded as its \equivalen e lass") withouthaving to ompute the lusters for all the data. In appli ations su h asstemming and on ation we just want to obtain the lass of a given word,whi h does not depend on the surrounding data. Even more diÆ ult, wemay want to ompute the lass of a word not in the data set (think, for ex-ample, of a misspelled word). A se ond requirement is the need to omputea luster hierar hy by s aling a parameter. This hierar hy must start withthe individual elements and end (possibly) with an equivalen e lass of theentire sample.A third and very important requirement is that of having a robust lus-tering method. Data samples are often ontaminated with spurious dataobje ts, that is, obje ts not following the same distribution as the rest.These are alled outliers. De iding that an obje t is an outlier is rather3

risky1. We require a robust method that assigns an equivalen e lass forea h element. Querying the system with any obje t will produ e its equiva-len e lass, and the result must be the same for any obje t in the lass usedas a query. If the equivalen e lass of the obje t is the obje t itself, then wewill all it an outlier.In the next se tion we present our lustering te hnique for metri spa es.Then, we show three di�erent appli ations of lustering in IR, all them underthe metri spa e model.In Se tion 3 we apply the lustering method to the problem of on atingwords that share a ommon root, whi h is a relevant task in IR. The stan-dard solution for stemming is a rule based approa h inherited from Porter'salgorithm. This type of solution losely follows the possible variations of asingle stem into all its morphologi al instan es. This te hnique su�ers fromseveral drawba ks, in parti ular its inability to ope with noisy data. For ex-ample, when a word is misspelled, �nding its orre t stem by following rulesis at least risky. A single edit error in a word may trigger the wrong ruleand ause a mistake in the on ation. We propose a more robust solutionbased in lustering on metri spa es.In Se tion 4 we use a simpli� ation of the lustering te hnique to builda data stru ture (index) on a spa e of text do uments. This data stru turepermits us to sear h for do uments similar to a given one or to a text query,under the lassi al osine similarity measure used in IR. The sear h te h-nique exploits the lustering properties of the index stru ture by s anningthe most promising lusters �rst. The approa h is an approximation: witha very low probability of missing relevant do uments, we an answer queriesby making only a fra tion of omparisons (distan e omputations) neededfor an deterministi algorithm.Finally, Se tion 5 resorts to a more general form of lustering: the data ispartitioned a ording to an arbitrary property, not only to spatial loseness.This is used to partition a data set into subsets that behave di�erent fromea h other in terms of their distan es to the rest. The goal is again to builda faster index using information of the luster stru ture of the data. Nowthe goal is to improve a deterministi algorithm by �ne tuning the algorithmfor ea h dis overed luster. In the example presented, di�erent index pa-rameters are used on ea h partition, ea h best suited to the lo al propertiesof ea h group. In a general perspe tive these indexes may have di�erent1There are a number of horror stories about automati data leaning (outlier removal)that show that, under an unknown distribution, it is better to be autious. An exampleis the ozone layer measurements, where the omputer �ltered out large holes as outliers,but the holes were a tually there. 4

lo al parameterizations of a single s heme, or even be ompletely di�erentapproa hes, su h as a an approximated approa h in one partitions and anexa t approa h on a di�erent partition. We demonstrate the e�e tivenessof the approa h on an appli ation to �nd the orre t versions of misspelledwords.2 Our Clustering MethodCluster and outlier dete tion are aimed at produ ing a model of the data.The primary obje tive of an appli ation is to partition the data into a mean-ingful olle tion of sets. A partition, in the mathemati al sense, is a ol-le tion of non-interse ting subsets whose union is the whole set. If the lustering de�nes a partition, then we will have a risp lustering. If thesubsets have non-empty interse tion then we will have a fuzzy lustering. Inthis hapter we are interested in risp lusters.The data model obtained from luster dete tion may be very sensitiveto outliers. The outliers ould be onsidered as \rare" data, not followingthe overall tenden ies of the group. In order to design a robust lusteringalgorithm we need to properly de�ne what a luster is. This de�nition shouldfollow some intuition and be omputable in a reasonable amount of time. Inaddition, the lustering should be able to give the luster of a given elementwithout lustering all the data, and it should be able to dete t outliers.Two additional restri tions are of importan e. The lustering should be arried out using only the distan es between obje ts, and no new obje tsshould be reated. These restri tions are a fair way to mask the appli ationlevel. All the domain knowledge will be en apsulated in the distan e om-putation. The resulting lustering te hnique will be suitable for do umentretrieval, for browsing images, or for handling multimedia data in general.Cluster and outlier dete tion is a lassi problem of non-parametri statisti s. Cluster analysis is the dete tion of groups that redu e the inter-group similarity and in rease the intra-group similarity. In this se tion weexplore several approa hes based on graphs for luster analysis. The strat-egy is to de�ne the lustering property and then to identify a random graph apturing it. This generi algorithm will satisfy our restri tions and hen ewill be suitable for metri spa es.The ultimate goal of any lustering algorithm is to postulate a modelof the data. Di�erent appli ations de�ne di�erent models and hen e dif-ferent obje tive fun tions, ea h of whi h will be optimized by a di�erent lustering. This optimization goal fun tion may be impli it or expli it. For5

exa t indexing purposes (su h as the appli ation of Se tion 5), the impli itgoal fun tion is the number of distan e omputations performed to solve aproximity query. A good lustering will produ e a partition that minimizesthe number of distan e omputations performed at sear h time. For ap-proximate sear hing (as des ribed in Se tion 4), the goal is to optimize thetradeo� between a ura y of the result and number of distan e evaluations.For lassi� ation purposes (des ribed in Se tion 3) the goal will be just thea ura y of the lassi� ation.Most lustering methods onsidered in the statisti al literature and instatisti al software, namely partitioning methods su h as k-means and k-medians [JD88℄, and agglomerative methods su h as dendograms [Har75℄,share a \two-stage" nature. First, by assigning di�erent values to someparameter (for instan e, k in k-means lustering), the algorithm produ es afamily of lustering stru tures, that is, a family of partitions of the originaldata into relatively homogeneous disjoint blo ks. Se ond, the user mustresort to some validation riterion, su h as those studied in [MC85℄, or usesome exploratory analysis of the results, in order to de ide whi h is the\right" lustering stru ture. It is desirable to have lustering pro eduresthat are \automati ", in the sense of providing the user at on e with theamount and identity of the lusters present in the data. This automati pro edure will be appealing only if we may �nd a way to disaggregate thoseobje ts not belonging to any group. This feature will be ru ial for data leaning appli ations.On the other hand, most lassi al lustering algorithms have quadrati or even higher omplexity, be ause they must measure global propertiesinvolving every distan e pair. Clustering is an inverse problem, just likeinterpolation. This means that it is not well de�ned in the mathemati alsense unless we add additional onstrains, alled \regularity assumptions",whi h distinguish between good and bad solutions. A rather lassi exampleof regularity onditions for a problem is the problem of �tting a urve to a setof dis rete points, or to interpolate the data. An in�nite number of solutionsexist for this problem, but we sele t a single smooth urve as the solution. Indete ting the luster stru ture of multivariate data we must make some kindof regularity assumptions. Most of the proposed algorithms in the literatureare based on heuristi regularity assumptions about the nature of the lustersand their shape. The user iterates several times, a epting some hypothesesand reje ting others, and ea h assumption leads to a di�erent lustering ofthe data. A diÆ ult problem remains: naming lusters, parti ularly in IR.6

2.1 Clustering in Metri Spa esWe will fo us on appli ations where the obje ts and their similarity de�nea metri spa e. This is formalized as a universe of valid obje ts X and adistan e fun tion d that satis�es the triangle inequality and is typi ally ex-pensive to ompute. We manage a �nite data set U � X, of size n. Theproblem of lustering an then be de�ned as partitioning U into sets su hthat the intra- luster distan es are \small" and inter- luster distan es are\large". In several appli ations, lustering is used as a tool for answeringproximity queries of the form (q; r), where q 2 X and r 2 RI +. This queryde�nes a query ball (q; r) = fx 2 X; d(q; x) � rg, and the goal is to retrievethe elements of the database that are inside the query ball, that is, (q; r)\U.Other types of queries are possible, but we will fo us on the simplest ase.The aim of metri spa e sear hing is to prepro ess the database U so as tobuild an index that answers queries (q; r) performing as few distan e ompu-tations as possible [CNBYM01℄. We will show that lustering U yields largeperforman e improvements on the sear hing performan e of the indexes.In luster analysis one often assumes that the data set omes from ad-dimensional real ve tor spa e RI d. Some algorithms, like k-means, makeheavy use of the oordinate information; assuming, for example, that new entroids an be reated. If neither the oordinates of the data, nor expli itobje t reation is used, then the algorithm is likely to be extensible to metri data sets. Further are must be taken if one wishes to extend a lusteringalgorithm to massive data sets, as the algorithm must s ale in performan eas the input data size grows. It is important to note that the distan efun tion ould be very expensive to ompute, and therefore the omplexitymust take into a ount this leading omplexity term.Re ent work has been done on lustering metri spa es by dire tly usinga mapping from the original metri spa e into a low dimensional ve torspa e, and then lustering the data using a traditional te hnique or an ad-ho variation. This approa h was tested in [GRG+99℄ using a generalizationof the well known BIRCH lustering pro edure [ZRL96℄, originally designedfor ve tor spa es. The strategy is to map the metri spa e into a ve torspa e using FastMap [FL95℄, and then to use BIRCH in the mapped spa e.To obtain a good lustering the mapping must be a urate. Unfortunately,FastMap is neither ontra tive nor expansive if the original spa e is notRI d with the Eu lidean distan e. Hen e the distortion in the distan es isunbounded. The algorithm proposed in [GRG+99℄ is linear in time. Our lustering approa hes will make a single pass over the data set to dis overea h grouping, hen e their total omplexity is at most quadrati .7

2.2 Mutual k-Nearest Neighbor Graph Clustering AlgorithmThe problem of identifying the lustering stru ture is hard. We may fo usinstead on the onverse problem, identifying the absen e of lusters in asample data set. If we have a small amount of sample data, we annot de ideif they luster together, sin e we have little impli it information about theirrelative proximity. So we begin by asking for a signi� ant number of samplepoints. We believe that most readers will agree that if the data set is drawnfrom a uniform distribution, then the data set is luster-free. This an begeneralized to distributions bounded away from zero and from in�nity, thatare almost-uniform distributions. If the data set is luster-free, we say thereis a single luster integrated by all the elements of the sample. If the datahave a luster stru ture, it is reasonable to expe t the data to be almost-uniform inside ea h lusters. We use this observation to base our approa hin the null hypothesis testing \the data is luster free".In [BCQY97℄, a te hnique is proposed to dete t lusters and outliersbased on a graph theoreti al-approa h. This te hnique has provable luster-dete tion power under some regularity onditions: (a) the lusters to bedete ted orrespond to di�erent onne ted omponents in the support2 ofthe sample distribution, and these onne ted omponents are at stri t pos-itive distan e from ea h other, (b) the sample distribution has a densitybounded away from zero and in�nity on ea h omponent of the support and( ) the onne ted omponents of the support of the distribution are grid ompatible3. We all a distribution with this properties a regular distribu-tion.We will all lusters the onne ted omponents of the support. Sin eit is assumed that onne ted omponents are at a stri tly positive distan efrom ea h other, the algorithm an dete t \ risp" lusters, as opposed to\fuzzy" lusters with overlapping support.The te hnique just des ribed postulates a graph, over random pointsin RI d. The random graph has a ontrolling parameter � for its density(e.g. the number of ar s for ea h node). Above ertain threshold the graphis onne ted with high probability. The threshold for � depends in turnon the underlying point distribution, the dimension of the spa e and thenumber of points. In short, we an always have a value of � su h that2The support of a random variable x, with density fun tion f(x), is the set S su h thatf(x) > 0. We strength this ondition by assuming f(x) > � (bounded away from 0).3See [BCQY97℄ for details, the essential idea is that the support a epts a dis retization(it annot be in�nitely thin) and it is ompletely ontained in a dis rete grid, su h thatea h grid ell has neighboring non-empty ells.8

the graph will be onne ted with high probability if the sample points aredrawn from a distribution that is bounded away from zero and from in�nityand will not be onne ted if the points ome from a distribution vanishingasymptoti ally in the support. If we have a sample data set oming from aregular distribution our goal will be to dete t the lusters (or equivalentlyto dete t the omponents of the support). Many graphs ould be de�nedfor lustering purposes under the above setup. In [BCQY97℄ they use theMutual Nearest Neighbor graph, de�ned below. In this hapter we alsodes ribe the range-r graph.The ontrolling parameter � of the random graph depends on the hara -teristi s of the probability spa e. The general te hnique relies on the properestimation of the onne tivity threshold of � in the random graph. In gen-eral, the estimation ould be two fold, a theoreti al bound and a pra ti alre ipe based on a Monte arlo simulation. To �nd the threshold, the pro- edure is to generate parti ular instan es of luster-free datasets. In thesedatasets we observe the minimum value of � su h that the random graph is onne ted.For example, we may hoose uniformly distributed points in RI 2 andestimate the minimum value of the parameter � to obtain a single on-ne ted omponent. After estimating � we will be able to dete t lusters (i.e. onne ted omponents) in arbitrary data, if the number of points and thedimension of the sample data, are both similar to the estimated values.The te hnique des ribed above an be applied to data sets with thesame topologi al hara teristi s. For example the parameter � for RI 2 andn = 1000 points in the sample may not be applied to data sampled fromRI 23 and n = 50; 000 points. Hen e the ontrolling parameter � will be afun tion of the spa e topology.After this digression, the entral question is: how to apply the pro edureto general metri spa es? In this ase we will need to �nd out an invariantin the sampled data and, for lustering, the parameter � must be appliedto data sets sampled from a probability spa e with the same invariant. Wewill use as invariant the intrinsi dimension of the data, de�ned as the mini-mum d su h that the data an be embedded in RI d without distortion. Thisapproa h an be used as long as the random graph an be omputed usingonly the distan es among the sample data. Note, however, that in most ases omputing the distan es between sample elements is very expensive; hen ethe algorithm performan e is measured in the number of distan e omputa-tions. Moreover, if the data is massive (gigabytes to terabytes of data) it isadvisable to use a te hnique that an work in se ondary memory, be ausedisk a esses ould be far more expensive than distan e omputations.9

Sin e the above embedding may be hard to �nd (as hard as �nding the lustering stru ture), one may resort to hierar hi al lustering, by movingthe parameter � to �nd the proper threshold. In this se tion we des ribe a lustering algorithm based on the Mutual Nearest Neighbor Graph and theRange r Graph respe tively. This lustering pro edure is able to dis over lusters at di�erent resolutions, and has subquadrati omplexity measuredin terms of the number of distan e omputations. To a hieve this omplexitythe algorithm relies on the use of an indexing algorithm. This indexing an be implemented using any indexing data stru ture. A better indexingalgorithm will lead to faster lustering, without impa t on the quality of the lusters obtained.2.2.1 The Clustering AlgorithmFrom an algorithmi point of view the pro edure is to �nd the k nearestneighbors of a seed point s, the set kNN(s) = fs1; s2; � � � ; skg, and toin lude in the luster of s those points s0 2 kNN(s) having the re ipro alproperty s 2 kNN(s0); and then to pro eed iteratively until no more points ould be added to the urrent luster, restarting with a new seed point untilno more seed points are left to visit. This algorithm is shown in �gure 2.2.1with S being the set of points, and k the number of neighbors onsidered.The output is a partition of the sample data, and ea h partition element iseither a luster or an outlier.MkNNCluster (k; s; S)1.2. let S0s = fs0 2 kNN(s)ks 2 kNN(s0)g3. return s[ MkNNCluster (k; s01; S) [ � � � [ MkNNCluster (k; s0k0 ; S)4.Partition (k; S)5. let S = fs1; � � � ; sng6. while(jSj > 0)7. output MkNNCluster(k; s0 pi k(S); S)8. S = S�MkNNCluster(k; s0 pi k(S); S)9. endwhileFigure 1: The MkNN lustering algorithm, k is the number of neighborsto onsider. 10

Calling the pro edure Partition of �gure 2.2.1 will identify all the on-ne ted omponents of a graph de�ned impli itly, alled the mutual k nearest-neighbor graph. A te hni al drawba k in this approa h, is to �nd out the orre t value for k. It is lear that, as k in reases, the number of onne ted omponents de reases; this gives a natural way to de�ne a hierar hi al stru -ture. Using brute for e for �nding the k nearest neighbor of ea h site in thedata set, takes O(n2) distan e omputations. In the next subse tion wedis uss in some detail how the above algorithm, and a family of algorithmsbased on similar assumptions, an be onverted on a lustering dete tionpro edure.2.2.2 Conne tivity PropertiesThe algorithm just des ribed depends heavily on the parameter k, the num-ber of neighbors to use. For k = 0 the partition onsist in one point for ea hpartition element. If k = n � 1 then we have surely one single set in thepartition. For any number of neighbors k in between we will have a �neror a oarser partition. We are interested in �nding the \right" number of lusters and we have as a parameter the number of neighbors, k, to use.If we annot postulate any a priory knowledge neither about the type northe shape of support of the distribution, we an always build a hierar hy ofpartitions.With a large enough k we an ompute the k-nearest neighbor graphof the data set and for ea h k0 � k we don't need any additional distan e omputation to �nd the k0-nearest neighbor. Ea h partition in the hierar hyis a oarsi� ation of the pre eding partition. In other words, if k1 � k2 thenea h equivalen e lass in the partition indu ed by k1 is ompletely ontainedin one equivalen e lass of the partition indu ed by k2. On e the hierar hy ispostulated, the \right" number of lusters ould be found using well knownintra luster/inter luster stress measures.If our goal is to produ e an \automati " lustering pro edure, we needto �nd a priory the value of the parameter k; and bound the sear h fora parti ular lass of distributions inside the onne ted omponents of thesupport. In parti ular we an restri t ourselves to ve tor spa es, or to metri spa es that an be embedded into a low dimensional ve tor spa e. It is notne essary to expli itly �nd the mapping, instead we an postulate a nullhypothesis.In [BCQY97℄ are proved both analyti al and experimental bounds fork, depending on both the size of the sample and the dimension of the em-bedded ve tor spa e. distribution is regular enough. The bounds proved11

are not tight, but they postulate a ompa t support and a uniform distribu-tion inside the lusters. Under this setup, the onne tivity onstant an beestimated using a Monte arlo approa h. The idea is to estimate the mini-mum k su h that there is only one luster for a given sample size, obtainedfrom a uniform distribution. The data obtained in the Monte arlo simula-tion is �tted to a model of the same family of the theoreti al bound, of theform kd = ad + bd log(n), having di�erent results for ea h di�erent intrinsi dimension d.2.3 The Range r GraphWe propose an alternative pro edure to dete t lusters and outliers, basedon the same foundations of the MkNN graph. The entral idea is to userange queries instead of k-nearest neighbor queries. Ea h site s will share anedge with s0 in the Gr graph if s0 2 (s; r)d, with (s; r)d = fs0kd(s; s0) � rg.With the above de�nition of Gr we an use algorithm 2.2.1 with essen-tially no modi� ations to partition the data into lusters and outliers. Thegraph Gr is simpler thanMkNN , and not he king for symmetry speeds upthe onstru tion. In both approa hes the use of an index to satisfy either k-nearest neighbor queries or range queries is advised to a hieve subquadrati omplexity.RangeCluster (r; s; S)1.2. let S0s = fs0 2 (s; r)dg3. return s[ RangeCluster (r; s01; S) [ � � � [ RangeCluster (r; s0j0 ; S)4.Partition (r; S)5. let S = fs1; � � � ; sng6. while(jSj > 0)7. output RangeCluster(r; s0 pi k(S); S)8. S = S�RangeCluster(r; s0 pi k(S); S)9. endwhileFigure 2: The range lustering algorithm, r is the ball radius.In algorithm RangeCluster presented in �gure 2.3, we an see that thepro edure is almost the same for our lustering strategy and the one obtainedusing the MkNN graph. We pro eed to obtain an adequate r a ording12

to our hypothesis that inside lusters the data distributes almost uniformly.The pro edure is to make a Monte arlo simulation of the algorithm to obtainstatisti s on the minimum r su h that there is only one luster in the sample.This algorithm performs on average O �n1+�� distan e omputations(the operation of leading omplexity), with 0 � � � 1 a onstant dependingon the intrinsi dimension of the sample data. The algorithm an dete toutliers in the sample data and, if desired, it an produ e a hierar hi alstru ture (a dendogram) pointing to lusters at di�erent resolutions.2.3.1 Outliers, Equivalen e Classes and StabilityThe random graphs dis ussed in our s ope are uniquely de�ned on e theparameter � (k and r, respe tively) and the point set are �xed. The twoexamples we present in subse tion 2.2.1 have additional advantages. Oneni e feature is that the random graph an be omputed in rementally. Inother words, we start with a single point and begin adding edges and verti esuntil no more verti es appear. We may think lusters as equivalen e lasses,and the in remental pro edure ensures obtaining the same lass for anyrepresentative. Hen e, the method is stable and deterministi .Certain points will be isolated in the MkNNG or the Range r graphs,forming a onne ted omponent of low ardinality. If the equivalen e lass ofa point is the point itself (or a very small number of elements, ompared withthe omplete data set), this equivalen e lass will be onsidered as an outlierof the sample. The dis ussion about the validity of this assumption is beyondthe s ope of this book. A ommon sense argument an be used however. Asmall equivalen e lass orresponds (a ording to our hypothesis) to a verysmall omponent of the support of the distribution. This implies either avery small probability of having points in that parti ular lo ation, or havingpoints where the support vanishes. In either ase, the presen e of thesepoints does not follow the same general rule of the rest of the sample.A �nal remark on the outlier issue. If most of the equivalen e lassesdete ted have a small ardinality, this naturally implies a small value (un-derestimated) of the parameter � , and not the presen e of \many" outliers.2.4 Radius vs. NeighborsThe ontrolling parameter of both lustering approa hes (Range and MkNNgraphs), the number of neighbors k and the radius r, must be estimatedbeforehand. We fa e here a di�erent problem, be ause we do not have atheoreti al model on the sear hing radius, moreover we do not expe t to13

obtain a �xed radius for every data set. The sear hing radius must s alewith the density of the points.The goal of our Monte arlo analysis is to �nd an invariant giving us arule to estimate the radius of the RangeCluster pro edure. We will have athand only the sampled data, with no good information about the intrinsi dimension of the data. We an use only inter-point statisti s to estimatethe parameter r. If the parti ular ase of spatial data (sampled from an Eu- lidean d-dimensional spa e) we an use intrinsi dimensionality estimators,but this does not apply to metri data from an arbitrary spa e.To guide the sear h we ran the lustering algorithm in uniformly dis-tributed data in the d-dimensional unitary ube. The goal is to estimate theminimum radius to obtain a single luster. The variables of the experimentare the number of points and the dimension. We had to take into a ountthat as the number of points in reased the minimum inter point distan ede reases, sin e the ube be omes more dense. Note that the number ofpoints in a ball grows exponentially with the ball radius.The relation found was the average number of points inside the (s; r)dball. For a �xed dimension, independently of the number of points (the datadensity), we found that when the average ardinality of the (s; r)d ball isabove ertain threshold there is always a single luster in our setup. Theradius of the (s; r)d ball was variable, revealing small hanges in the lo aldensity even in the uniformly distributed data.The rule is then simply stated as sear hing with a radius su h that theball aptures at least nd points. If the data is very on entrated (with highlo al density) then the sear hing radius is small and onversely. This impliesthat we need inter-point statisti s, to estimate the radius of the (s; r)d ballfor the algorithm. Fortunately we an use statisti s from the index onstru -tion to propose the appropriate sear hing radius, sin e the onstru tion ofthe index uses inter point distan es in the onstru tion pro ess. In otherwords it does not imply an overhead in the number of distan e omputations.2.5 The Conne tivity ParametersWe in lude the results of the Monte arlo simulation for the onne tivity ofthe MkNN graph from [BCQY97℄ below, for low dimensional data. The R2 olumn is the �tness value, high values are better. The sampled standarddeviation �̂d is also shown.On e the intrinsi dimension of the data is estimated, we an use table 2.5to ompute the onne ted omponents of the MkNN graph, as des ribed.The user will judge if the obtained equivalen e lasses orrespond to lusters14

dim adjusted model R2 �̂d2 4.341 + 0.430 lnn 0.979 1.3133 3.422 + 0.646 lnn 0.984 1.7444 2.651 + 0.924 lnn 0.983 2.2665 1.859 + 1.219 lnn 0.976 2.6746 0.268 + 1.687 lnn 0.994 3.0927 -0.770 + 2.019 lnn 0.989 3.3648 -2.453 + 2.474 lnn 0.994 3.666Table 1: Adjusted models for k (as des ribed) in the MkNN random graph(uniform data in [0; 1℄d).Dimension 2 3 4 5 6 7 8 9 10Sites 4.37 5.07 5.51 6.05 6.22 8.03 7.78 8.59 10.32Table 2: The minimum number of points to have a single luster, in di�erentdimensions.or outliers, based on the ardinality and the relative position of the data.For the Range r graph a similar table an be obtained after the orre-sponding Monte arlo simulation. A threshold parameter � must be founddepending on the number of points, and the (intrinsi ) dimension of thedata. In this ase the threshold parameter annot be easyly stated in abso-lute numbers, sin e the inter-point distan e is variable. If we �x the numberof points in the sample, and regard the \minimum radius su h thatm pointsare aptured" we will obtain a table similar to table 2. We an also �x theradius as a per entage of the minimum inter-point distan e, the maximuminter-point distan e, et . The proper invariant should be sele ted a ordingto the user needs.2.6 Intrinsi DimensionThe intrinsi dimension of the data is an important invariant in metri spa e samples. Sin e one annot assign dire tly oordinates or a dimensionnotion to the data, we have to resort to ve tor spa es where the notionof dimension is properly de�ned. If the intrinsi dimension of the data isknown the lustering stru ture ould be dete ted. Conversely, if the lusterstru ture is known, the intrinsi dimension ould be found.15

Both arguments are valid. When dis overing the lustering stru tureof a sample we are postulating a parti ular distribution of the data. Thisimplies that we have an inverse problem, sin e the intrinsi dimension, anddistribution are unknown, we have to �x one of them. The lustering pro e-dure des ribed ould be used also as an intrinsi dimensionality estimator.The Range r graph, or unit distan e graph, and the MkNN graph ould beused as an intrinsi dimensionality estimator. The pro edure will be the onverse of the lustering pro edure. We dis over the lustering stru tureusing a hierar hi al approa h. After this we obtain (from the appropriatetable) the intrinsi dimension orresponding to the value of the parameter,and the number of points used.3 Morphologi al Stemming and the Holomorphi Distan e3.1 MotivationThe problem of on ating words sharing a ommon root is an important taskin information retrieval. The standard solution for stemming is a rule basedapproa h inherited from Porter's algorithm. This type of solution loselyfollows the possible variations of a single stem into all its morphologi alinstan es. This te hnique su�er from several drawba ks, being the mainproblem the presen e of noise in the pattern, i.e. spelling errors. When aword is misspelled then �nding its orre t stem by following rules is at leastrisky. A single edit error in a word may trigger the wrong rule and that will ause a sure mistake in the on ation.In this se tion we des ribe a di�erent alternative for stemming and on- ation. This te hnique has been proposed in [Cha02℄. The idea is to use theso alled holomorphi distan e to dete t words with similar stems, and thento use a lustering algorithm to isolate lusters. The input is a di tionaryof words and the output is a partition of the di tionary into words sharinga stem.Traditional sequen e omparison is done by arefully aligning the for-mant symbols of a sequen e to obtain the minimum number of atomi oper-ations (insert, delete, substitute) to onvert one sequen e into another. Thede-fa to hoi e for omparing sequen es has been the Levenshtein or editdistan e and variations obtained by adding or inhibiting atomi operations.This naturally implies that the distan e takes into a ount only lo al in-formation. Moreover, our sequen e omparison does not use the ontext of16

strings, as the expert knowledge in some very fo used �elds annot be mean-ingfully mapped into insertion/deletions/substitutions or other operations,mainly be ause the similarity per eived by the expert is not lo al.The holomorphi distan e for sequen es is not based on the edit distan efor words. The idea is to extra t a number of features of the sequen e (inthis ase a word in a do ument), and to use the features to build a ve tor,in the same fashion that is done for do uments in IR. Below we give a moredetailed des ription of the distan e.For information retrieval it is often desirable to on ate all the variationsof a word into a single stem. This operation on ates, for example, house,housing, houses into the single stem hous. In this way, when the userwants to query the system with any of the variations of the word, all thedo uments ontaining the same stem will be retrieved. In a regular language(like English) there will be no problem in building a stemmer using thePorter's algorithm. In a Semiti or roman e language the stem may not bea pre�x of the word, and the onstru tion of a stemmer using the Porter'sapproa h may not lead to an eÆ ient solution. Other problem often fa ed ineither language is querying or indexing misspelled words. In this ase even aperfe t stemmer would fail to on ate to the proper stem. A morphologi alstemmer, on the other hand, will make use of the word stru ture to guidethe on ation pro ess. To this end it would be ne essary to assign a smalldistan e to all the words with the same stru ture and a large distan e towords with di�erent stru ture.3.2 The Holomorphi TransformationThe holomorphi transformation generalizes the notion of similarity devel-oped for do uments. The general idea is to de ompose the original sequen ein a number of sub-sequen es using a number of �xed rules. This rules will befeature extra tors of the sequen es. The rule will be to obtain sub-sequen eswhen a feature extra tor is applied to a sequen e. If � is the alphabet, ands 2 �� is a sequen e, then a fun tion � : 2�� ! 2�� is a feature extra tor.We put no onstraints to the type of onstru tions we allow for extra tingfeatures. Any fun tion will �t our needs.The limits imposed will be (if any) the omputational omplexity toobtain the transformation, and the a priori knowledge of the designer.The holomorphi transformation will be a fun tion H : �y ! RI m withm the size of the ontext [�y;S℄. This transformation depends heavily onthe set of feature extra tors. The ontext will be the vehi le to introdu edomain knowledge to the problem of sequen e omparison. Note that on e17

a sequen e is transformed using the holomorphi transformation, we an usethe osine distan e to ompare the orresponding ve tors.The ontext is the generalization of the vo abulary of a do ument ol-le tion. Under this point of view, the distan e between a given pair ofsequen es will depend on the ontext. The distan e will not be absoluteany longer. Note also that the positional information may be lost after theholomorphi transformation. This is the ase of transforming do umentsinto ve tors. If we want to re over the positional information we need tobe very areful with the design of the feature extra tion fun tions. In thefollowing subse tion we will dis uss some examples of this.3.3 A Morphologi al Stemmer Using ClusteringWe an use the holomorphi distan e to build a morphologi al stemmer.For a roman e language the stem of a word is almost surely ertain pre�x,and the synta ti al ategory is ertain suÆx. If we use q-grams, pre�xesand swaps (�gq , �pj and �2!j ), and arefully sele t the relative weights ofthe individual features, we are in position of measure the morphologi al loseness for word pairs. Tables 3 and 4 are examples of the loseness usingthis approa h. The idea of the distan e between the words is to �nd themaximum pre�x, allowing misspellings, between two words. This maximumpre�x is weighted by the presen e of similar pre�xes in the orpus.The omplete algorithm for stemming using the holomorphi transfor-mation has two stages. First, we �nd the lusters of the orpus using theholomorphi distan e. These lusters oin ide with the words on ated forsharing the same stem. This de�nes equivalen e lasses by morphologi al loseness. In the se ond stage we query the orpus and �nd the losestequivalen e lass, tp report the same representative for ea h equivalen e lass member. This representative may be some normal form of the stem(e.g. the verb in in�nitive) or the largest ommon stem in the equivalen e lass.A related problem in information retrieval and omputational linguisti s,is synta ti al tagging. Here the problem onsist in assigning a label to theword, indi ating if it is a verb, adverb, noun, et . A similar s heme as theone des ribed for stemming may be followed to tag words.4 Clustering for Approximate Proximity Sear hIn this se tion we des ribe an approximate proximity sear h algorithm basedon a lustering data stru ture. The lustering method we use is in the same18

H. non weight H. weight Edit distan e1 apetitosa apetitosa apetitosa2 apetito apetitoso apetitoso3 apetitoso apetito a eitosa4 apetite apetite apestosa5 apetitiva apetitiva apetitiva6 apetitivo apetitivo apetito7 apetible apetible a eitoso8 apete er apete er a etoso9 apete edor apea alentosa10 apeten ia apeten ia aparatosaTable 3: The 10 nearest neighbors of the Spanish word apetitosa with twodi�erent weighted holomorphi distan es, and the edit distan e. The editdistan e reports words with di�erent meaning (3-4, 7-10), while the holo-morphi distan e provides very strong loseness (ex ept 9 in the weighted ase).H. non weight H. weight Edit distan e1 dettagli dettagli dettagli2 dettaglio dettaglio dettaglio3 dettagliato dettagliata dettai4 dettagliati dettagliate dentali5 dettagliate dettagliati deragli6 dettagliare dettagliato dettami7 dettagliata dettagliare dettati8 dettagliante dettagliante dettavi9 dettaglianti dettaglianti rettali10 dettagliatamente detta battagliTable 4: The 10 nearest neighbors of the Italian word dettagli with twodi�erent weighted holomorphi distan es, and the edit distan e, with resultssimilar to the Spanish ase. 19

spirit of that des ribed in Se tion 2.2, although simpli�ed.The List of Clusters data stru ture [CN00℄ is a list of \zones". Ea h zonehas a enter and a overing radius r( ), de�ned as the maximum distan ebetween and any element of the zone. A parameter of the stru ture is thenumber m of elements inside ea h zone ( ounting the enter).To build the stru ture, a �rst enter 2 U is hosen at random, andits m � 1 nearest neighbors are found. A zone is made up of and thesem� 1 neighbors. Its overing radius r( ) is de�ned as the distan e to the(m � 1)-th neighbor. On e this �rst zone, I, is determined, we remove theelements of the set and build the rest of the list over E = U � I. The next enter is hosen as the element that maximizes the sum of distan es to theprevious enters.The onstru tion pro ess returns a list of triples ( i; ri; Ii) ( enter, ra-dius, zone elements). Figure 3 (left) shows the pseudo ode for the onstru -tion algorithm. On the right we show an example with three zones. Notethat this data stru ture is asymmetri , be ause the �rst enter hosen haspreferen e over the next enters in ase of overlapping balls. The brutefor e algorithm for onstru ting the list takes O(n2=m) distan e omputa-tions, but it an be improved using auxiliary data stru tures to �nd theneighbors.BuildList (Set of obje ts U)1. if U = ; then return empty list2. Sele t 2 U3. I the m elements in U losest to 4. r maxu2I d( ; u)5. E U � I6. return ( ; r( ); I) : BuildList(E)(c , r ) (c , r ) (c , r )

1 1 2 2 3 3

I I I

E E E

r

r1

r

c

c

2

3

c1

3

2

Figure 3: On the left, onstru tion algorithm of List of Clusters. Note thatI in ludes . The operator \:" is the list on atenation. On the right, agraphi al example.To solve a query (q; r), d(q; ) is omputed, reporting if it is within the20

query ball. Then, we sear h exhaustively inside I only if d(q; )� r( ) � r,be ause otherwise the zone with enter annot have interse tion with thequery ball. Also, given the asymmetry of the data stru ture, we an prunethe sear h in the other dire tion: if the query ball is totally ontained in( ; r( )), that is, if r( ) � d(q; ) < r, then we do not need to onsider E,be ause all the elements that are inside the query ball have been inserted inI. Figure4 depi ts the sear h algorithm.Sear hList (List L, Query q, Radius r)1. if L = ; then return /* empty list */2. Let L ( ; r( ); I) : E3. Evaluate d( ; q)4. if d( ; q) � r then report 5. if d( ; q) � r( ) + r then sear h I exhaustively6. if d( ; q) > r( )� r then Sear hList(E; q; r)Figure 4: Sear h algorithm on List of Clusters.The sear h ost for this data stru ture has a form lose to O(n�) forsome 0:5 < � < 1:0 [CN00℄.4.1 The Ve tor Model for Information RetrievalIn IR [BYRN99℄ a do ument is de�ned as the re overy unit. This an be aparagraph, an arti le, a Web page, et . The lassi al models on IR onsiderthat every do ument is des ribed by a representative set of keywords alledterms. A term is a word whose semanti s help des ribe the prin ipal topi sof a do ument.The most popular of these models, the ve torial model, onsiders a do -ument as a t-dimensional ve tor, where t is the total number of terms in thesystem. Ea h ve tor oordinate i is asso iated to a term of the do ument,whose value orresponds to a positive weight wij if that term belongs to thedo ument or zero if not. If D is the set of do uments and dj is the j-thdo ument in D , then dj = (w1j ; w2j ; : : : ; wtj).In the ve torial model, a query obje t q an be a list of a few terms oreven a whole do ument. The similarity between a do ument d and a queryq is taken as the similarity between ve tors ~dj and ~q, and quanti�ed as the21

osine of the angle between the two ve tors:sim(dj ; q) = ~dj � ~qj~dj j � j~qj = Pti=1 wij �wiqqPti=1w2ij �Pti=1 w2iq (1)where wiq is the weight of the i-th term of the query q. The weights of theterms an be al ulated using tf-idf s hemes:wij = fi;j � log�Nni� fi;j = freqi;jmax`=1:::t(freq`;j)where N is the total number of do uments, ni is the number of do umentswhere the i-th term appears, and fij is the normalized frequen y of the i-thterm: freqi;j is the frequen y of the i-th term in dj .If we onsider that the do uments are points in a metri spa e, thenthe problem of sear hing for do uments similar to a given query is redu edto a proximity sear h in the metri spa e. Sin e sim(dj ; q) is only a simi-larity fun tion, we will use the angle between ve tors ~dj and ~q, d(dj ; q) =ar os(sim(dj ; q)), as our distan e metri , so (D ; d) will be our metri spa e.Despite of this lear link, metri spa e te hniques have seldom been usedfor this purpose. One reason is that the metri spa e of do uments has avery high dimension, whi h makes any exa t sear h approa h una�ordable.In fa t, there is no exa t algorithm that an avoid an almost exhaustivesear h on the database to answer proximity queries in this parti ular metri spa e.The standard pra ti e is to use an inverted index to �nd the do umentsthat share terms with the query and then sear h this set of andidate do u-ments by brute for e. This te hnique behaves well on queries of a few terms,but it be omes impra ti al when queries are do uments. This is required,for example, in relevan e feedba k s enarios.In the latter ase, it is ustomary to use heuristi s, be ause the de�ni-tion of relevan e is already fuzzy. In most ases, �nding some good answersis as good as �nding them all, parti ularly when the sear h ost is drasti- ally redu ed. This is a ase where metri spa e sear h with approximationalgorithms would be of great value, as it is mu h better to use an approxi-mation where the quality and degree of exa tness is well understood, thana heuristi whi h resists analysis.4.2 Te hniques for Approximate Proximity Sear hingWe des ribe a sear h te hnique where the user an tune between eÆ ien yand quality of the answer. The maximum number of distan e omputations22

we an perform is �xed and denoted by quota. On e quota has been rea hed,no more elements an be onsidered. The sear h algorithm is approximatedin the sense that some relevant elements an be missed be ause we ouldnot �nd them before we exhausted the quota. Hen e, it is ru ial to usethe allotted quota eÆ iently, that is, to �nd as soon as possible as manyelements of the result as possible.The te hnique des ribed in this se tion is alled ranking of zones [BN02℄.The idea is to sort the zones of the List of Clusters in order to favor themost promising, and then to traverse the list in that order. The sorting riterion must aim at qui kly �nding elements that are lose to the queryobje t. As the spa e is partitioned into zones, we must sort these zonesusing the information given by the index data stru ture.We ompute d(q; ) for every enter, and estimate how promising a zoneis using only d(q; ) and r( ) (whi h is pre omputed). One not only wouldlike to sear h �rst the zones loser to the query, but also to sear h �rst thezones that are more ompa t, that is, the zones whi h have smaller overingradii (sin e all the zones have the same number of elements). Some zoneranking riteria are (all in in reasing order, see Figure5):� d(q; ): the distan e from q to ea h zone enter.� r( ): the overing radius of ea h zone.� d(q; ) + r( ): an upper bound of the distan e from q to the farthestelement in the zone.� d(q; ) � r( ): a lower bound of the distan e from q to the losestelement in the zone.� �(d(q; ) � r( )): what we all dynami beta.The �rst two te hniques are the simplest ranking riteria. The thirdte hnique aims to sear h �rst those zones that are loser to q and also are ompa t. The fourth te hnique uses the lower bound of the distan e betweenthe query obje t and any element of the zone, given by the index stru ture.If fa tor � is �xed, then the last te hnique is equivalent to the riteriond(q; ) � r( ), be ause the ordering is the same. However, instead of usinga onstant fa tor � 2 [0::1℄, we use a dynami fa tor of the form � =1=(1:0� r( )m r ), where m r is the maximum size of the overing radius of allzones. This implies that we redu e more the sear h radii in zones of larger overing radii. 23

��

��

qc

cr(c)

d(q,c)(a) d(q; ) ��

��

qc

cr(c)

(b) r( )��

��

qc

cr(c)

d(q,c)+cr(c)( ) d(q; ) + r( ) ��

��

qc

cr(c)

d(q,c)−cr(c)(d) d(q; )� r( )Figure 5: Some zone sorting riteria.4.3 Experimental ResultsFigure 6 shows the results of experiments on a subset of 25,000 do umentsfrom The Wall Street Journal 1987{1989, from TREC-3 [Har95℄. We om-pare the approximate algorithm using di�erent ranking riteria. We used lusters of m = 10 elements and show queries with sear h radii that return,on average, 9 and 16 do uments from the set. For example, on the left wesee that, using the riterion d(q; )+ r( ), we an retrieve 90% of the resultsusing 10,000 distan e omputations, that is, examining 40% of the spa e.We re all that all the exa t algorithms require examining almost 100% ofthis spa e.The results show that the approximate algorithms an handle well thisspa e, and that the best riteria were r( ) and dynami beta. We ouldretrieve more than 99% of the relevant obje ts while traversing merely a17% of the database. This is the �rst feasible metri spa e approa h to thislong standing problem.To show up to whi h point the on ept of lustering has been essentialfor this good result, we onsider a tempting de lustering idea. Instead of24

0

5000

10000

15000

20000

25000

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

Dis

tanc

e ev

alua

tions

Fraction of the result actually retrieved

d(q,c)cr(c)

d(q,c)+cr(c)d(q,c)-cr(c)

dynamic beta

0

5000

10000

15000

20000

25000

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

Dis

tanc

e ev

alua

tions

Fraction of the result actually retrieved

d(q,c)cr(c)

d(q,c)+cr(c)d(q,c)-cr(c)

dynamic beta

Figure 6: Comparison among di�erent riteria in a do ument spa e, retriev-ing on average 9 elements (top) and 16 elements (bottom).25

sorting the lusters a ording to how promising the enter looks, let us sortall the elements separately a ording to the extra information given by theirdistan e to the enter. For example, the lower bound riterion groups allthe elements of a zone of enter under the value d(q; )� r( ). If we knewthat an element u in this zone is at distan e d( ; u) of its enter, we ouldre�ne its estimation to d(q; )� d( ; u). This, however, performs worse thanthe original method. We onje ture that the reason is that we lose valuable lustering information when we rank ea h element separately.Finally, we note that this data stru ture reminds the lustering te h-niques based on nearest neighbors overed in previous subse tions, but it ismore rough and simple. It is likely that better results an be obtained witha more sophisti ated lustering algorithm.5 Clustering for Metri Index BoostingIn this se tion we explore a di�erent alternative for lustering. Instead ofdire tly dete t the data grouping by distan e, we will dete t the lusters in adi�erent domain. For indexing purposes we are interested in hara terizingthe intrinsi diÆ ulty of a given dataset. A step towards this is by dete tingsegments of the dataset whi h are more diÆ ult to index than others. Herewe propose a parti ular pro edure obtaining lusters of data whi h are notne essarily lose to ea h other in the distan e domain. This an be put inthe following terms, for lustering we will be interested in minimizing theintra luster distan e d(x; y), while maximizing the inter luster distan ed(x; y). For our version of lustering we are interested in a di�erent property,namely to minimize the intra luster measure f(x; y) and hen e to maximizethe inter luster measure. The goal is to split the data in diÆ ult-to-indexand easy-to-index sets. We begin proposing a two way split, but this an begeneralized to m-way splits as well.Most indexing algorithms for proximity sear hing have tuning param-eters. These parameters allow one to balan e onstru tion time, memoryusage and sear h time, adapting the performan e of the index dependingon the hara teristi s of the data. The most relevant feature in a metri data set is how the data is distributed. Finding the underlying stru tureof a data set is very useful to design an indexing algorithm. In parti ular,knowing how the elements are lustered in the metri spa e help us identifythe hardest region to be sear hed. On e the regions are ategorized as easy,medium or hard for sear hing, we an lo ally tune the parameters for ea hregion. Moreover, we an build independent indexes for ea h region, and26

sear h ea h index separately at sear h time. This has proven to be more ef-� ient than using global parameters and a single index. Another appli ationof the lo al parameterization te hnique ould be, for example, to index onepart of the database with an exa t sear hing index and another part withan approximated index that will give good answers almost all the time.In this se tion we use the data distribution to segment the databaseinto just two parts: the hardest to be sear hed and the rest. This an begeneralized to a �ner partition, but we ontent ourselves with illustrating thete hnique. Noti e that this does not orrespond any more to a traditional lustering where we aim at grouping data that is spatially lose. Rather,we group data that share some ommon properties. This an be seen asgrouping data that are lose after applying a hange of domain.One way of visualizing the data distribution is by using distan e his-tograms. Given a metri spa e (X; d) and an element p 2 X, the lo alhistogram with respe t to the referen e point p is the distribution of thedistan es from p to all the elements x 2 X. The lo al histogram an be verydi�erent from the global histogram of all the distan e pairs in X. However,if several lo al histograms are similar, then we an predi t the behavior ofthe global histogram of a data set U � X.One of the main diÆ ulties in metri spa e sear hing is the so- alled urse of dimensionality. Some metri spa es ( alled \high dimensional")have a very on entrated histogram, with small varian e and typi ally largemean. This means that random pairs of distan es are very similar fromea h other, or alternatively, that from the point of view of a given elementp, all the others are more or less at the same distan e. All indexing methodsare based on pre omputing some distan es and inferring lower bounds onother distan es [CNBYM01℄. For example, if the index has pre omputedd(p; u) and, when sear hing for (q; r), we ompute d(p; q), then we know bythe triangle inequality that d(q; u) � jd(p; u) � d(p; q)j, so we an dis ardu without ever omputing d(q; u) if it turns out that jd(p; u) � d(p; q)j > r.However, this (and any other attempt to avoid omputing d(q; u) for everyu) be omes useless if the spa e is high dimensional, sin e in this ase d(p; u)will be very lose to d(p; q). However, for those elements u far away fromthe entral region in the lo al histogram of p, the referen e point p an be agood tool to dis ard them.If a group of elements is at the same time in the entral region of thehistograms of several referen e points, then those elements represent a subsetwhere sear hing is inherently diÆ ult. We all this group of elements thehard kernel of the spa e and denote it hk(X; d). The remaining elementsbelong then to a soft kernel denoted by sk(X; d). The idea is then to index27

and sear h separately the hard and soft kernels. That is:� Partition the data set U into hk(U; d) and sk(U; d).� Index separately hk(U; d) and sk(U; d).� Solve (q; r) in U by sear hing hk(U; d) and sk(U; d) separately.Dete ting hk(U; d) is simple. We just interse t the entral regions of lo alhistograms for several di�erent referen e points p. After �nding hk(U; d),we have that sk(U; d) is the omplement.Figure 7 des ribes the dete tion pro ess of the hard kernel of a data setU. The parameter s is the fra tion of elements that should belong to thehard kernel. The parameter r is the utting radius used to delimit the entral region in the lo al histogram of the referen e point p. The idea is totake the elements surrounding the median of the histogram.Compute hk(Set of obje ts U, Fra tion s, Radius r)1. hk(U; d) U2. Choose a point p 2 U3. while jhk(U; d)j > s � jUj do4. m medianfd(p; u); u 2 Ug5. hk(U; d) hk(U; d) \ fx 2 U; d(x; p) 2 [m� r;m+ r℄g6. if U \ hk(U; d) 6= �7. Choose a point p 2 U � hk(U; d)8. else9. Choose a point p 2 U10. return hk(U; d)Figure 7: Algorithm that �nds hk(U; d)To test this algorithm we used a data stru ture for metri spa es alledGNAT, explained next.5.1 GNATsGNATs (Geometri Near-neighbor A ess Trees [Bri95℄) are m-ary treesbuilt as follows. We sele t, for the root node, m enters 1 : : : m, and de�neUi = fu 2 U; d( i ; u) < d( j ; u);8j 6= ig. That is, Ui are the elements loserto i than to any other j . From the root, m hildren numbered i = 1 : : : mare built re ursively as GNATs for Ui . Figure 8 shows a simple example ofthe �rst level of a GNAT. 28

u10

u13u5

u4

u2

u12u3

u7

u1

u15

u14

u8

u9

u2 u5 u3 u9

u10 u12 u6 u14 u13 u8u4u7 u11 u15 u1

u11u6

Figure 8: Example of the �rst level of a GNAT with m = 4.The GNAT stores at ea h node an O(m2) size table rangeij =[minu2Uj( i; u);maxu2Uj( i; u)℄, with minimum and maximum distan esfrom ea h enter to ea h lass. The tree requires O(nm2) spa e and isbuilt in lose to O(nm logm n) time.At sear h time, the query q is ompared against some enter i and wedis ard any other enter j su h that d(q; i)� r does not interse t rangei;j,sin e all the subtree Uj an be dis arded by the triangle inequality. Thepro ess is repeated with random enters until no one an be dis arded.The sear h then enters re ursively into ea h non dis arded subtree. In thepro ess, any enter lose enough to q is reported.The performan e of the GNAT heavily depends on the arity of the tree.The best arity is di�erent for ea h metri spa e, and even for di�erent subsetsof the data set. In parti ular, the hard kernel happens to require a largearity, while the soft kernel is sear hed better with a smaller arity tree. Hen e,we will illustrate our te hnique by hoosing di�erent arities for hk and sk.5.2 Experimental AnalysisWe experimented with a metri spa e of strings under the edit distan e (also alled \Levenshtein distan e"). This fun tion is dis rete and omputes theminimum number of hara ters that we have to append, hange and/ordelete from one word to obtain the other. This distan e has appli ations ininformation retrieval, signal pro essing and omputational biology.We have used a Spanish di tionary of 86,061 words, and experimentedwith di�erent values for s and r. The experimental setup onsist in �ndingthe best arity for the whole di tionary, and then splitting it in several ways29

to �nd a good proportion of soft/hard kernel tuning the individual aritiesof the proportions. Ea h ombination is ompared against the best arityfor the whole di tionary. With this we ensure a fair omparison, sin e we ompete with the best possible tuning of the GNAT for the whole di tionary,against individual tuning for the hard/soft kernels.We hose 500 random words from the di tionary and sear hed for themusing distan e radii r 2 1 : : : 4. For ea h sear h we omputed the ratio ofthe sear h ost using separate indexes versus one global index. A value of1 indi ates the same performan e in either approa h, while a smaller valueimplies that the luster-based approa h is better than the standard one.Ea h point in the graph orresponds to the average measure of 500 queries,hen e we expe t a low varian e for this measure.Figure 9 shows the tuning of the GNATs, and we observe that an arityof 256 is the best we an do for low sele tivity queries, while an arity of 128is the best for high sele tivity queries. We must hoose a given arity if wewill use only one index for all type of queries. We hose an arity of 256 to ompare with the lustered approa h.

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

55000

60000

1 2 3 4

Distance Computations

Radius

Whole Dictionary

GNAT16 GNAT32GNAT64GNAT128GNAT256

Figure 9: Tuning the GNAT. We try to �nd the best arity for the GNATand found that this number is not the same for low/high sele tivity queries.The best is GNAT-256 for most of the radii.For ea h utting radius, and ea h proportion we tested all the ombina-tions of arities for the utting proportion. In �gure 10 we show an exampleof su h an experiment, for a utting radius of 2, and a proportion of 0.4 in30

the hard kernel and 0.6 in the soft kernel and a �xed arity of 128 in thehard kernel, to �nd the best arity for the soft kernel. This experiment wasexhaustive in all the possible ombinations.In �gure 11 we show how the di�erent proportions are ompared to theoriginal di tionary. We observe a systemati improvement for high sele tiv-ity queries, while the improvement proportion is smaller in low sele tivityqueries. This is a natural onsequen e of the non-monotoni ity of the tuningof the GNAT, but it is not explained solely by this, sin e for metri rangequeries of radius 2, both arities 128 and 256 in �gure 9 have the same per-forman e. The lustered index is better even in this ase. We also observethat a balan ed partition is the best hoi e, using an arity of 128 in ea hsegment.

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

55000

1 2 3 4

Distance Computations

Radius

Kernel (cr=2, s=0.40) GNAT128 in hk

GNAT16 in skGNAT32 in skGNAT64 in skGNAT128 in skGNAT256 in sk

Figure 10: For ea h utting radius and kernel proportion we sele ted thebest ombination of arities.The example we have presented an be improved in a number of ways, forexample by partitioning the data into more than two lusters, or by buildinga luster hierar hy. More improvements an be expe ted by using di�erent lustering strategies. The use of lo al histograms is a fast te hnique, butmore ostly te hniques may produ e a better segmentation.31

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1 2 3 4

Radius

Whole Dictionary vs Kernel (cr=2)- GNAT256 in whole Dictionary

s=0.1, GNAT128 in sk - GNAT64 in hks=0.1, GNAT256 in sk - GNAT64 in hks=0.2, GNAT128 in sk- GNAT128 in hks=0.3, GNAT128 in sk- GNAT128 in hks=0.4, GNAT128 in sk- GNAT128 in hks=0.5, GNAT128 in sk- GNAT128 in hks=0.6, GNAT128 in sk- GNAT128 in hks=0.7, GNAT128 in sk- GNAT128 in hks=0.8, GNAT128 in sk- GNAT128 in hks=0.9, GNAT64 in sk- GNAT128 in hkFigure 11: On e sele ted the best ombination for ea h kernel proportionwe ompare it with the best-tuned GNAT. A balan ed partition with arity128 seems to be the best hoi e.A knowledgementsWe a knowledge the support of CYTED Proje t VII.19 RIBIDI. The �rstand last author also a knowledges the support of the Center for Web Re-sear h, Millenium Resear h Initiative, Mideplan, Chile.Referen es[BCQY97℄ M. R. Brito, E. L. Ch�avez, A. J. Quiroz, and J. E. Yuki h.Conne tivity of the mutual k-nearest-neighbor graph in lus-tering and outlier dete tion. Statisti s & Probability Letters,35:33{42, 1997.[BN02℄ B. Bustos and G. Navarro. Probabilisti proximity sear h-ing based on ompa t partitions. In Pro . 9th InternationalSymposium on String Pro essing and Information Retrieval(SPIRE'02). LNCS, Springer-Verlag, 2002.[Bri95℄ S. Brin. Near neighbor sear h in large metri spa es. In Pro .21st Conferen e on Very Large Databases (VLDB'95), pages574{584, 1995. 32

[BYRN99℄ R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Re-trieval. Addison-Wesley, 1999.[Cha02℄ E. Ch�avez. Knowledge based distan es for sequen e ompari-son. In Sistemi Evoluti per Basi di Dati (SEBD), pages 34{47,2002.[CN00℄ E. Ch�avez and G. Navarro. An e�e tive lustering algorithmto index high dimensional metri spa es. In Pro . 7th Interna-tional Symposium on String Pro essing and Information Re-trieval (SPIRE'00), pages 75{86. IEEE CS Press, 2000.[CNBYM01℄ E. Ch�avez, G. Navarro, R. Baeza-Yates, and J. Marroqu��n.Proximity sear hing in metri spa es. ACM Computing Sur-veys, 33(3):273{321, 2001.[FL95℄ C. Faloutsos and K. Lin. Fastmap: a fast algorithm for index-ing, data mining and visualization of traditional and multime-dia datasets. ACM SIGMOD Re ord, 24(2):163{174, 1995.[GRG+99℄ V. Ganti, R. Ramakrishnan, J. Gehrke, A. L. Powell, and J. C.Fren h. Clustering large datasets in arbitrary metri spa es. InInternational Conferen e of Data Engineering, pages 502{511,1999.[Har75℄ J Hartigan. Clustering Algorithms. John Wiley & Sons, NewYork, 1975.[Har95℄ D. Harman. Overview of the Third Text REtrieval Conferen e.In Pro . Third Text REtrieval Conferen e (TREC-3), pages 1{19, 1995. NIST Spe ial Publi ation 500-207.[JD88℄ A.K. Jain and R.C. Dubes. Algorithms for Clustering Data.Prenti e Hall Advan ed Referen e Series, 1988.[MC85℄ G. W. Milligan and M. C. Cooper. An examination of pro- edures for determining the number of lusters in a data set.Psy hometri a, 50:159{179, 1985.[Zha71℄ C. T. Zhan. Graph theoreti al methods for dete ting and de-s ribing gestalt lusters. IEEE Transa tions on Computing,20:68{86, 1971. 33

[ZRL96℄ T. Zhang, R. Ramakrishman, and M. Livny. Bir h: An eÆ- ient data lustering method for very large databases. In ACMSIGMOD International Conferen e on Management of Data,pages 103{114, 1996.

34

Inf - DCCUChilegnavarro/ps/cir02.pdf · 2 Our Clustering Metho d 5 2.1 Clustering in Metric Spaces....

Documents

Transcript of Inf - DCCUChilegnavarro/ps/cir02.pdf · 2 Our Clustering Metho d 5 2.1 Clustering in Metric Spaces....