Intelligent Clustering Engine

download Intelligent Clustering Engine

of 10

Transcript of Intelligent Clustering Engine

  • 7/27/2019 Intelligent Clustering Engine

    1/10

    ICE Intelligent Clustering Engine: A clustering gadget for Google Desktop

    Lando M. di Carlantonio a,, Bruno A. Osiek a, Geraldo B. Xexo a, Rosa Maria E.M. da Costa b

    a Universidade Federal do Rio de Janeiro UFRJ, COPPE Programa de Engenharia de Sistemas e Computao, Cidade Universitria, CT, H319, CEP 21941-972 Rio de Janeiro, RJ, Brazilb Universidade do Estado do Rio de Janeiro UERJ, IME Departamento de Informtica e Cincia da Computao, Rua So Francisco Xavier, 524, B-6o, CEP 20550-013 Rio de Janeiro,

    RJ, Brazil

    a r t i c l e i n f o

    Keywords:

    Information retrievalText miningDocument clusteringGenetic Algorithms

    a b s t r a c t

    In light of the increased capacity and lower prices of computer hard drives, a new universe to be exploredemerges, the microcosm of personal files. Although search and information retrieval techniques arealready widely used in the Internet, its application in personal computers is still incipient. This paperdescribes a new tool for document clustering in the desktop, whose effectiveness in obtaining groupswith similar documents is evidenced by the experimental results.

    2012 Elsevier Ltd. All rights reserved.

    1. Introduction

    Despite the increasing amount of information available in theInternet, storing files in personal computers is a common habitamong Internet users, which is essentially justified for threereasons:

    availability is not always permanent a shortcut in the favor-ites folder that points to a document that no longer exists isuseless;

    although the information is probably available in the Internet inmore than one site, the user bypasses having to locate it again;

    obtaining information is not always immediate the timeinvolved depends on the file size and connection speed.

    But this habit creates a new problem for the user, when theavailable storage space on their machines becomes abundant:how to find the desired information in a simple, fast and efficientway?

    Even users who do not have this habit, when they need to findin the scores of shortcuts saved in their favorites folder, are facedwith the question: which one leads to the page where the desiredinformation can be found?

    In fact the information in the Internet does not have a rigorousorganization. The impossibility of maintaining order in a vast anddiversified structure is not considered an obstacle to its globalacceptance as a useful knowledge repository. But little time andeffort to search for specific information represent very valuableaspects (Liu, Wu, & Liu, 2011). Tools that strive for simplicity andagility in information retrieval have been prominent among those

    offered by the Internet. Google (Google Inc., 2011g) and Gmail(Google Inc., 2011f) are great examples.

    The philosophy adopted by Gmail (Google Inc., 2010) poses cer-tain questions whose answers can corroborate the relevance of thetopic covered in this work:

    Why waste time deciding which files can be discarded, relin-quishing files that can be useful in the future, if disk space isno problem?

    Why spend time sorting documents, if we can retrieve themquickly, whenever we need, through a simple search?

    Why adopt a rigid structure for classifying documents, if theycan be perceived as similar by other criteria other than thoseimposed by the single hierarchy of a directory architecture?

    Without the use of efficient techniques for search and informa-tion retrieval, a great deal of time is consumed in organizing andobtaining the information needed.

    In the Internet, the use of such techniques is now widespread(Song, Choi, Park, & Ding, 2011), but in terms of personal comput-ers, the tools are quite limited. The objective of this paper is topresent a new tool, based on a system created by Carlantonioand Costa (2009): a clustering gadget to be used for file searchesin desktop computers, called Intelligent Clustering Engine (ICE).

    In comparison to the Carlantonio and Costa system, we high-light the main contributions of the ICE system:

    new approach to a desktop indexer; new weight in the ordination; new compact interface; new visualization; comparative tests with other software; test results made with public domain database.

    0957-4174/$ - see front matter 2012 Elsevier Ltd. All rights reserved.doi:10.1016/j.eswa.2012.02.101

    Corresponding author.

    E-mail addresses: [email protected] (L.M. Carlantonio), [email protected](B.A. Osiek), [email protected] (G.B. Xexo), [email protected] (R.M.E.M. Costa).

    Expert Systems with Applications 39 (2012) 95249533

    Contents lists available at SciVerse ScienceDirect

    Expert Systems with Applications

    j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a

    http://dx.doi.org/10.1016/j.eswa.2012.02.101mailto:[email protected]:[email protected]:[email protected]:[email protected]://dx.doi.org/10.1016/j.eswa.2012.02.101http://www.sciencedirect.com/science/journal/09574174http://www.elsevier.com/locate/eswahttp://www.elsevier.com/locate/eswahttp://www.sciencedirect.com/science/journal/09574174http://dx.doi.org/10.1016/j.eswa.2012.02.101mailto:[email protected]:[email protected]:[email protected]:[email protected]://dx.doi.org/10.1016/j.eswa.2012.02.101
  • 7/27/2019 Intelligent Clustering Engine

    2/10

    This paper is organized as follows: in Section 2 we present anoverview of some interesting tools available for desktops. In Sec-tion 3 we describe the ICE gadget, and in Section 4 we discussthe experimental results, followed by the conclusions and futureworks in Section 5.

    2. Desktop tools

    Among the fewtools available to desktop search, four stand out:Aduna AutoFocus (Aduna, 2009a) (2.1), Google Desktop (GoogleInc., 2011h) (2.2), Carrot2 (Carrot Search, 2011a) (2.3) and Ergo(Invu Services Ltd, 2009) (2.4). Among them, only Ergo is not a freetool. Next we present some information about them.

    2.1. Aduna AutoFocus

    Aduna AutoFocus is a desktop search application that uses aguided exploration strategy (Aduna, 2009a).

    There are versions available for Windows, Linux, Mac OS, andother platforms with Java support.

    Basically, the user selects the sources to be indexed, submits the

    query and then proceeds to explore the results, dismemberingthem, through the selection of new terms and features.

    After a source to be selected, which can be a directory, a net-work drive, IMAP, HTTP or HTTPS, the program carries out filesindexation, identifing the 10 most significant terms of eachdocument.

    If a document is among those returned by the search, theirwords will be offered to the user (column with significant termsof the detail table) during the exploitation of results, allowing tofilter the results through these terms.

    Exploitation of results can also be made by the selection of fac-ets, among which the keyword suggestions stand out. In general,the program offers up to 50 words considered as the most relevant.

    The visualization of results is done through diagrams, called

    Cluster Map, which are very similar to the Venn diagrams or Eulerdiagrams, and whose main objective is to show whether and howthe groups formed during the exploration overlap (Aduna, 2009b).

    This software supports various types of files, for example, MSOffice, OpenOffice, TXT, HTML, PDF, XML, etc. The search can be re-fined by choosing particular fields, such as text and title/subject,among others (Aduna, 2011).

    The program offers several operators that allow creating com-plex queries, such as fuzzy operator (~) and proximity operator(~number). These two operators also exist in the Lucene (TheApache Software Foundation, 2011b), with which the Aduna Auto-Focus has a certain similarity in terms of operators and querysyntax.

    2.2. Google Desktop

    Google Desktop (Google Inc., 2011h) is a desktop search appli-cation that provides a sidebar similar to Windows 7 and WindowsVista, where gadgets can be included. There are versions for Win-dows 7/Vista/XP/2000, Linux, and Mac.

    Gadgets, in the software industry, are small programs that canbe aggregated to a larger system (Wikimedia Foundation, 2011c).In addition to Google Desktop, gadgets are available in Windows7/Vista, Mac OS X, KDE, Gnome, and iGoogle (Google Inc., 2011l).

    Google Desktop indexes many types of text files, besides musicand videos files (Google Inc., 2011i). We can also add others plu-gins (Google Inc., 2011j) that are specific to source codes of pro-gramming languages.

    Google Desktop has some specific operators, other than those ofGoogles site, such as the historic Web operator for a specific site

    (site:), the operator to limit the search to a particular folder ordirectory and its subdirectories (under:), and the operator to lim-it the search to a specific computer (machine:) (Google Inc.,2011b). Google Desktop also provides a history of all files andWeb sites accessed, sorted by date and time, through the itemTimeline.

    2.3. Carrot

    2

    Carrot2 (Carrot Search, 2011a) is an open source framework forbuilding clustering engines, to group, in thematic categories, theresults provided by sites and search programs.

    In the context of text mining, the clustering technique has thisgoal, i.e., automatically cluster texts (or documents) on the samesubject and separate texts of different subjects (Manning, Ragha-van, & Schtze, 2008; Wikimedia Foundation, 2011a).

    As a formal definition of the problem, we have:From a set of n documents, X= {X1,X2, . . . , Xn}, where each Xi

    that belongs to Rp is a vector with p dimensions that measuresthe attributes of the document. They must be grouped so thatgroups C= {C1, C2, . . . , Ck} are disjoint, where k is a priori an un-known value and represents the number of groups (adapted from(Hruschka & Ebecken, 2003)).

    The following conditions must be found:

    (a) C1 [ C2 [ [ Ck =X;(b) Ci ;, "i, 16 i 6 k;(c) Ci \ Cj = ;, "ij, 16 i 6 k e 16j 6 k.

    By definition, a document can only belong to one group (Yin, Hu,Yang, Li, & Gu, 2011), but there are also definitions in the literaturethat allow an object to belong to more than one group (Yi-Ouyang,Yun-Ling, & AnDing-Zhu, 2007).

    The clustering problem is often considered an optimizationproblem, where, through measures of entropy or silhouette, forexample, it seeks to determine the point of the search space which

    maximizes the differences between groups and the similaritieswithin groups (Agustn-Blas et al., 2012; Madylova et al., 2009;Xu & Wunsch, 2005). The ideal number of groups where docu-ments should be divided is one of the challenges of the problem.There are some proposed solutions that do not require initial val-ues for its determination (Chang, Zhao, Zheng, & Zhang, 2012; Cura,2012; Xiao, Yan, Zhang, & Tang, 2010).

    The clustering technique allows the user to find documentgroups of interest instead of individual documents. This allows areduction in result overhead, favoring also semantic gains, sincethe context (the other words included in the document) in whichthe word is contained influences the inclusion of the documentin one or another group. This contributes to distinguishing docu-ments that contain the word jaguar (car) from those that contain

    the word jaguar (animal).In the Internet we currently find many sites that offer the clus-

    tering technique, highlighting for instance: the official website ofthe US government (USA.gov, 2009), whose search was developedby Vivisimo (Vivisimo Inc., 2010); Allplus (WebLib, 2011); Grokker(Groxis, 2009); KartOO (Kartoo, 2009); Yippy (Yippy Inc, 2010); andCarrot2 itself, which also offers a search site (Carrot Search, 2011b).

    Carrot2 is implemented in Java and has components for Google(Google Inc., 2011g), MSN (Microsoft Corporation, 2011a), Yahoo!(Yahoo! Inc, 2011b), Google Desktop (Google Inc., 2011h), Solr(The Apache Software Foundation, 2011d) and Lucene (The ApacheSoftware Foundation, 2011b).

    Carrot2 is not a search engine, nor does it have crawlers orindexers. For these roles, it suggests using Nutch (The Apache

    Software Foundation, 2011c) for the first, and Lucene or Solr forthe second. A relevant aspect involving Lucene is the fact that it

    L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 95249533 9525

  • 7/27/2019 Intelligent Clustering Engine

    3/10

    suggests, reciprocally, Carrot2 to the function of clustering (TheApache Software Foundation, 2011a).

    Carrot2 provides several algorithms for clustering: Lingo, STC(Suffix Tree Clustering), Rough K-means and Fuzzy Ants. The firsttwo are especially designed for clustering search results.

    Carrot2 offers a library and a set of support applications. The li-brary, besides the clustering functions themselves, provides toke-

    nizers, stemmers and lists of stop words for several languages.The application suite contains the following options (CarrotSearch, 2011c):

    Carrot2 Document Clustering Workbench desktop applicationthat allows quick experiments, besides being useful in identify-ing the appropriate values of the various existing parameters foreach algorithm available;

    Carrot2 Document Clustering Server Web server that offers thefunction of clustering as a Web service REST (RepresentationalState Transfer) (Wikimedia Foundation, 2011g);

    Carrot2 Web Application Web application for end users equalto the one available online (Carrot Search, 2011b).

    In this work, we will use the application Carrot2 DocumentClustering Workbench to perform comparative tests between theICE gadget and the Carrot2 framework. This application/frameworkwas chosen due to the fact that it is able to interact with GoogleDesktop, the platform chosen for developing the ICE gadget.

    Carrot2 has two types of visualization: a circle-shaped (CirclesVisualization), made with the use of Adobe Flash Player; and an-other one using the Aduna Cluster Map, in an older version thanthat used by Aduna AutoFocus.

    The applications to be developed using the Carrot2 frameworkcan be done in two ways:

    software development in Java are used JAR (Java Archive)(Wikimedia Foundation, 2011f) files and calls are made to Car-rot2 API;

    software development in other languages Carrot2 DocumentClustering Server is installed and configured, and then, callsare made to the server using the REST protocol.

    2.4. Ergo

    Ergo (Invu Services Ltd, 2009) is a software for search resultsclustering that can work with search programs or sites, similar toCarrot2. Unlike the three previous tools, Ergo is a proprietary soft-ware. Until very recently, it was possible to download an evalua-tion copy of this software. But in September 2010, the softwarewas patented under the name Wagumo, available on a new web-site (http://www.wagumo.com).

    The program is written in J# (J Sharp) and requires several addi-

    tional programs for its operation, such as, .Net Framework andSQLServer Compact, besides virtual printers being installed. Ergoruns on Windows XP and Vista.

    Like Carrot2, several search sources can be used: Google (GoogleInc., 2011g), Yahoo! (Yahoo! Inc, 2011b), Flickr (Yahoo! Inc,2011a), YouTube (Google Inc., 2011q), Wikipedia (WikimediaFoundation, 2011i), among others. For document clustering in thedesktop, we must also install the Windows Desktop Search. Per-haps, Windows Search, his successor, can also be used.

    Ergo has a strong visual appeal, especially in result navigation,in particular when using Flickr (Yahoo! Inc, 2011a) as data source,when only the photos are displayed. The program uses the Win-dows Presentation Foundation, graphical subsystem of the .NETFramework 3.0 (Microsoft Corporation, 2011b; Wikimedia Founda-

    tion, 2011j). There are several options to display the groupsformed, some with 3D effects. Besides the functions of search

    and navigation, the program offers an annotation feature, with amenu similar to Microsoft Office 2007, from which text, tables,images, etc. can be inserted, and then export the result as an XPSfile (XML Paper Specification) (Microsoft Corporation, 2011d).

    As for clustering, it is important to highlight that the programdoes not create groups with unique content, the same documentcan belong to several groups.

    3. ICE gadget

    The ICE gadget was created following a similar structure to theSAGH (Genetic Analytical System of Grouping Hypertexts), a sys-tem created by Carlantonio and Costa. Fig. 1 (from Carlantonio &Costa, 2009) shows the seven modules of the SAGH system, as wellas their input and output files.

    Among the peculiarities of SAGH, we highlight:

    expanding the concept of stop words for empty stems, whereany word that has a stem like some of those obtained by stem-ming of the list of stop words is dropped;

    super-powered population a resource that aims to increase

    the quality of found clusters, where the clustering algorithmis carried out several times, in order to obtain a set of evolvedindividuals, which will be used as initial population for the lastrun of the algorithm;

    creation of differentiated p-dimensional space, where each doc-ument is entitled to supply its most frequent term, according tothe sorting type chosen (tf term frequency, idf inverse doc-ument frequency or tfidf, or in the case of the ICE gadget, tftfortfidf) for the composition of this space, discarding the repeatedterms.

    As for the sorting criterion idf(inverse document frequency), itis calculated by Eq. (1).

    idf log

    n

    df

    1

    where n is the number of documents to be grouped and df is thenumber of documents that contain the term.

    The clustering module (based on the technique proposed byHruschka & Ebecken (2003)) uses the technique of Genetic Algo-rithms, Artificial Intelligence technique that aims to find exact orapproximated solutions to optimization problems, through in-spired mechanisms in evolutionary biology (Jain, Murty, & Flynn,1999; Song, Wang, & Li, 2009; Wikimedia Foundation, 2011d),which is why we chose the name ICE Intelligent Clustering En-gine for the gadget.

    As characteristics of the clustering algorithm, we can highlight:

    partitioning method; chromosomes of constant size during execution (see Fig. 2

    from (Carlantonio & Costa, 2009)); fitness function based on the silhouette (Wikimedia Founda-

    tion, 2011h); cosine similarity (Wikimedia Foundation, 2011b); stop criterion based on the number of generations; use of the elitism; roulette-wheel selection; crossover and mutation operators oriented to groups; random initial population; does not require any input parameter; provides the number of groups and their contents.

    Regarding the fitness function, we have the followingequations:

    9526 L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 95249533

    http://www.wagumo.com/http://www.wagumo.com/
  • 7/27/2019 Intelligent Clustering Engine

    4/10

    si bi ai

    maxfai; big2

    Which can be rewritten as:

    si

    1 ai bi; if ai < bi

    0; if ai bi

    bi ai 1; if ai > bi

    8>: 3

    From the above definition, we have:

    1 6 si 6 1 4

    and finally:

    OFXni1

    si

    n5

    where a(i) is the average distance of the document i 2 cluster A toothers documents of the A; b(i) is the minimum of d(i, C), with CA, where d(i, C) is the average distances of the document i 2 clus-ter A to the documents of the C; s(i) = 0, if the cluster has only onedocument. And having as objective function (OF), the arithmeticmean of s(i), where n is the total number of documents.

    The cosine similarity is calculated by Eq. (6).

    cosh A B

    kAk kBk6

    where A and B are vectors that represent the documents in whichwe want to evaluate the similarity.

    The ICE gadget is designed to run on the platform offered by

    Google Desktop (Google Inc., 2011h). In this first version, the sys-tem can sort and group HTML documents.

    Fig. 1. The SAGH system.

    Fig. 2. A chromosome: partitioning of the documents + number of distinct groups.

    L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 95249533 9527

  • 7/27/2019 Intelligent Clustering Engine

    5/10

    Among some existing visualization techniques, Hyperbolic Tree(Bouthier, 2011; Bou, 2011a), Jung (OMadadhain, Fisher, Nelson,White, & Boey, 2011), Guess (Adar, 2011), and Network Workbench(NWB Team, 2011), we adopted the hyperbolic tree technique, alsocalled hypertree, by an implementation called Treebolic (Bou,2011a), to show the groups found by ICE gadget.

    This technique offers a graph visualization that is based on

    hyperbolic geometry. It considerably reduces the necessary spacefor the display of a tree, because it highlights the nodes that arein focus, while the others have their size compressed on the bor-ders (Wikimedia Foundation, 2011e). This form of compact viewis crucial in the case of the gadgets, because we are dealing withapplets with reduced screens.

    The following sections give an overview of the main themes re-lated to this work, we describe the structure of the Google gadget(3.1), the Treebolic suite (3.2) and finally, we discus the details ofthe ICE gadget (3.3).

    3.1. Structure and creation of a gadget

    A Google Desktop gadget consists of JavaScript code, XML filesand objects and functions provided by Google Gadget API (GoogleInc., 2011c).

    The default file extension of the gadget is GG, i.e., Google gad-get. This file type is, in fact, a zipped file that contains the followingelements:

    (a) an XML file called gadget.gmanifest, which contains meta-information about the gadget (name, version, author, APIsused, etc.);

    (b) another XML file called main.xml, which defines the mainview with the user (interface objects, their appearance prop-erties, and function names to be called during the occur-rence of certain events);

    (c) a JavaScript file called main.js, where the functions men-tioned in the previous item are encoded;

    (d) images for the various states of interface objects and of iconsof the gadget (formats: BMP, JPG, PNG, and GIF);

    (e) and, finally, another XML file called strings.xml, with infor-mation that will be displayed on the about dialog box ofthe gadget.

    Other gadgets, such as those for Windows 7/Vista, also havesimilar structures (Lee, 2008).

    In addition to those basic components, we can include otherXML files to specify an options view (Avram, 2007), a details view(Stucki, 2007), as well as JavaScript files, or VBScript (Visual BasicScripting Edition) files, to define the functionality of these inter-faces or to organize the code (Filimon, 2008).

    A sophisticated and original visual interface can be created, as

    we can set images for each state of the interface objects, in additionto easily define their transparency and rotation effects (Filimon,2007; Schirmer, 2007; Thangaraj, 2007).

    One can also use a dynamic-link library in the gadgets, encapsu-lating native code inside ActiveX automation objects, creating theso-called hybrid gadgets (Olczyk, 2007), which have as a limit offunctionality only those defined by the operating system.

    For this simple yet powerful structure, the number of gadgetsavailable (Google, Microsoft Windows, Yahoo!) is quite consider-able. Only in the Windows Live Gallery (Microsoft Corporation,2011c), we found 5502 gadgets in English language. Realizing thepotential of these applets, big companies have created gadgets topromote their online content (Amazon.com, Inc, 2011; InfogloboComunicao e Participaes, 2011).

    To create gadgets, Google Desktop provides the RAD (RapidApplicationDevelopment) software, called GadgetDesigner(Google

    Inc., 2001e; Google Inc., 2011o). It creates the basic files needed, al-lows debugging and viewing the gadget running.

    After the creation of the interfaces and encoding of the func-tions, the program enables generating the GG file, through theoption build package in the gadget menu.

    To learn how to build the gadgets quicker, nothing better thanexamples and there are several in the Gadget Designer download

    package. Another possibility is the large number of gadgets offeredon the Google Desktop site (Google Inc., 2011k), as well asthe tuto-rials (Google Inc., 2011n), articles (Google Inc., 2011a) and docu-mentation (Google Inc., 2011p), in particular, the references tothe gadgets API (Google Inc., 2011d) and to query API (GoogleInc., 2011m).

    3.2. The Treebolic suite

    Treebolic is a Java suite that implements hyperbolic trees (Bou,2001b). It offers several features, as well as a very practical naviga-tion. To use it we incorporate the Java applet into an HTML file anddescribe the groups using the XML format defined by the project.

    Two programs stand out among those in the installation pack-

    age to understand the use of Treebolic: the Treebolic Demo, to helpbecome familiarized with the functionality offered; and the Tree-bolic Generator, to understand how the features are stored inXML files, which is fundamental for the creation of trees atruntime.

    The specification of the tree includes several items: statusbar,toolbar, pop-up menu, nodes, etc. The tree can be split into severalseparate files, which enables to assemble or dismantle the sub-trees during the visualization.

    In the pop-up menu, we can include several options, amongwhich we highlight the option to search for a node with the spec-ified text, by the criteria: start with, includes, or equals.

    In relation to the nodes, besides the label and the content, wecan also set colors, images and links to sites or local files.

    3.3. The ICE gadget

    The ICE gadget is a document clustering tool in desktop com-puters. It interacts with Google Desktop, grouping the results re-turned by this indexer. Its compact and rich visualizationinterface provides much information about the clusters formedand their contents.

    Besides the operators offered by Google Desktop, we can choosefrom the tftf or tfidf weights, and the application of the super-powered population to improve the results.

    Figs. 3 and 4 show the ICE gadget interface and its visualizationin a search example.

    The concept of visualization is based on the overview-detail

    idea, where files can be loaded easily, being displayed in separate

    Fig. 3. ICE gadget.

    9528 L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 95249533

  • 7/27/2019 Intelligent Clustering Engine

    6/10

    windows, not changing the visualization of the tree, enabling theuser to not lose the context in which the document is placed.The nodes of the tree can be put into focus by a simple click, orautomatically by enabling the option hovering triggers focus,and then, briefly positioning the mouse cursor on the node. Clusternodes can be mounted and unmounted on demand, enabling acompact display and targeted to user interests, making the naviga-tion easier and identification of relevant files faster.

    In its visualization, the ICE gadget offers, on a color scale, a mea-sure of relative distance for the documents, which takes into ac-count two extremes: the centroid of the group and the documentfarthest from the centroid. This information is also available inthe tooltip through the parameter RCU (ratio centroid to ultimatedocument), which ranges from 0 to 1. The visualization also pre-sents the principal terms of the cluster and the number of docu-

    ments within it.For documents, we also have the title, the snippet that Google

    Desktop has provided for the file, its principal terms and their loca-tion (link). The terms used in the search appear in the Google Desk-top logo (in this example, cocoa, in the center), they areincorporated into the list of stop words at runtime, because theybecome irrelevant to the clustering process, since they will be inall documents returned by Google Desktop.

    It is possible to perform searches for a node (cluster or docu-ment) containing certain text in the label or in the content (princi-pal terms, snippet, RCU and link, in the case of documents; andprincipal terms, in the case of clusters). The position of the status-bar and the toolbar can be changed on the interface or even de-tached from the window, extending the usable area available to

    the tree in the visualization window.We decided to not show in the visualization the groups that

    have only one document, nor include them in the group others.The activity diagram of the ICE gadget can be seen in Fig. 5.

    The ICE gadget is more demanding in terms of processor than interms of memory. The testing machine used was a Core 2 QuadQ6600, 2 GB RAM, offering a very good run time, of course this de-pends on the number of files returned by the query.

    The most sensitive stages of the process are:

    creation of vectors of terms when the amount of files is largeand/or when they have many words;

    genetic analysis of clusters when the number of files is large.

    A possible improvement of the gadget could be the creating vec-tors of terms of all HTML files available in the computer, at the

    intervals when the processor is idle, similar to what Google Desk-top does with regard to indexing. That would eliminate much ofthe sensitivity of this step.

    Of course, it would be necessary to adapt this module to the cre-ation of the dictionary file only after the query is submitted, be-cause to use the tfidf weigh it is necessary that this file iscreated taking into account just the words (and their occurrences)

    of the documents retrieved by the query.The clustering algorithm has more difficulty in separating thedocuments if they have many terms in common, so the weightsthat generate a greater distance between the documents are moresuitable.

    During some tests, we found that using a non-trivial weight, thetftf weight, usually provides interesting results. What happens inpractice when using this weight is that the terms that occur onlyonce in the documents have their relevance reduced even more.

    This weight was chosen as default due to the good resultsachieved, its simplicity and lower computational cost when com-pared to tfidf. When the tftf weight does not separate the docu-ments, one can try using the tfidf weight or the feature super-powered population with that or this weight.

    One feature that differentiates these weights is that the tftfweight tends to separate the documents more, providing a greaternumber of groups.

    4. Experimental results

    For the tests, a subset of Reuters data base was used, the Reu-ters-21578 (Jiang, Pang, Wu, & Kuang, 2012; Lewis, 1997).

    As the Reuters-21578 data set is very large, with 21,578 docu-ments in 135 categories (topic field), we promote the followingcuts:

    from the data set, we calculated the average number of charac-ters in the body of the documents (body field) and selected

    those documents with a number of characters greater thanthe average (835.5719) in this field, yielding 10,369 documents;

    then, we excluded the documents that do not have a topic fieldfilled out, reducing the number to 3263 documents. With thisprocedure, the number of distinct categories changed from135 to 73;

    next, we eliminated those categories that have less than about30 documents and those with more than about 100 documents;

    from this subset, we divided the documents into three parts,trying to compose a set with the categories that have about30 documents, another one with about 50, and the third onewith about 100 documents.

    With this, we chose the following categories, limited to three, to

    form the test data, depending on the number of documents in thecategories (in brackets), shown in Table 1.

    The reasoning involved in choosing the categories that composethe varied data set was based on the fact that cocoa and coffee areexportable agricultural products and, certainly, the ship categoryis closely related to export, possibly constituting a data set moredifficult to be clustered, since the categories bear some similarities.

    After identifying these subsets, we created HTML files, one foreach document, containing their body and title fields. We waitedfor Google Desktop to index the 4 folders that defined the basesand then we started with the testing.

    Early on, we encountered the following question: which term,or terms, to choose as query? We realized that the choice thatcould influence less the results would be, precisely, to not choose

    any term, but allow the programs to evaluate the documents intheir entirety, of course, ignoring the stop words (Cutting, Karger,

    Fig. 4. ICE Visualization.

    L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 95249533 9529

  • 7/27/2019 Intelligent Clustering Engine

    7/10

    Pedersen, & Tukey, 1992). So, the searches were made in the for-mat: under:path.

    Because of the difference in the type of clustering conducted byprograms, where the ICE gadget creates groups with unique con-tent and Carrot2 does not (overlapping the results), we adoptedthe following methodology:

    we decided that the cluster belongs to a category according tothe predominance of documents;

    in case of a draw, the cluster is ignored; we accepted the division of a category in more than one group,

    according to the previous items; groups that have only one element are not considered;

    the group labeled other topics in Carrot2

    is also notconsidered;

    we calculated the percentage of accuracy in the largest clusterof the category and the percentage of accuracy involving allgroups of the category, i.e., the percentage of documents ofthe X category that is in the highest group assigned to it andthe percentage of documents of the X category that is in allthe groups that were assigned to it.

    We used the default values of Carrot2. As for ICE, we used thedefault weight, tftf, but we used the resource of super-powered

    population.

    4.1. Small data set

    For the data set containing the small categories, the ICE gadgetfound five groups, grouping 89 of the 95 existing documents. Car-rot2 found 28 groups (not including the group other topics). Ofthe 28 groups, seven had only two documents.

    The counting of the grouped documents cannot be done easilyin Carrot2, because its clustering can be overlapping. This featuremakes the comparison of results difficult.

    Ignoring the documents placed in the group other topics (13),the number of documents that are inside other groups is 145,although there are only 95 documents in the data set.

    Table 2 shows the results, according to the adopted methodol-ogy, where the numbers in brackets represent:

    Specifiesthe Query

    Reports thatthere are

    no Results

    User Google DesktopICE

    Visualizesthe Results

    Submitsthe Query

    SearchesResults

    Recordsthe Snippets

    Generates theVisualization

    Cluster

    Normalizesthe Matrix

    Creates theMatrix

    Creates theDimension

    Classifiesthe Vectors

    Creates Vectorsof Terms

    Recordsthe File List

    Fig. 5. ICE Activity diagram.

    Table 1

    Test data created.

    Small (4.1) Medium (4.2) Large (4.3) Varied (4.4)

    Oilseed (28) Gold (50) Gnp (92) Cocoa (35)Bop (32) Coffee (67) Ship (92) Coffee (67)Cocoa (35) Sugar (68) Interest (115) Ship (92)

    Total: 95 Total: 185 Total: 299 Total: 194

    9530 L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 95249533

  • 7/27/2019 Intelligent Clustering Engine

    8/10

    in column headers: number of documents in each category; others: the greater number of documents, of a given category,

    that were put together, i.e., its largest group (LG).

    In comparison,Carrot2 performed better thanthe ICEgadget onlyin one of six items, besides the great difference in the number ofclusters found, 28 to 5 of the ICE. As for the largest group, the ICEgadget grouped70% more documents thanCarrot2 (bop category).

    4.2. Medium data set

    In the data set containing the medium categories, the ICE gadget

    created seven groups, clustering 181 of the 185 existingdocuments.

    Carrot2 created 40 groups, whose sum provides 168 documents(not including the group other topics). Of the 40 groups, 17 hadonly two documents. The group other topics had 60 documents.Table 3 summarizes the results of this data set.

    Comparing the results, the ICE gadget was worse than Carrot2

    only in three of six items. In relation to the largest group, it joined120% more documents than Carrot2 in the coffee category, and464% more than in the sugar category.

    Again, the difference in the number of clusters found was signif-icant, seven of ICE against 40 of Carrot2.

    One issue (not shown) that caught our attention was the factthat Carrot2 grouped only 26 (or less, because of repetition) of

    the 50 documents in the gold category.

    4.3. Large data set

    The ICE gadget created exactly three groups, clustering 297 ofthe existing 299 documents, in the test involving the largecategories.

    Carrot2 generated 50 groups (not including the group othertopics), whose sum was 288 documents. Thirteen of the 50 groupshad only two documents. The group other topics aggregated 103documents.

    Table 4 shows the results for this data set.Analyzing the results, it was noticed that the ICE gadget had a

    better result when compared with Carrot2 in only one of the six

    items. But the difference in the number of documents groupedfor the largest group was significant, with the ICE obtaining 300%more documents than Carrot2 for the gnp category, 988% morein the ship category, and 339% more in the case of interestcategory.

    We emphasize the difference in the number of groups found,which was the largest of all, three of ICE against 50 of Carrot 2.

    4.4. Varied data set

    In the test case involving the data set containing categories with

    varied sizes, ICE produced five groups, using 190 of the 194 exist-ing documents.As for Carrot2, it provided 40 clusters (not including the group

    other topics), summing 192 documents. In the group other top-ics 61 documents were placed. Seventeen of the 40 groups hadonly two documents.

    The results are presented in Table 5.In this last test, Carrot2 performed better than ICE only in two of

    six items, but again with large differences in relation to documentsincluded in the larger groups. The ICE gadget grouped 121% moredocuments, in the case of the coffee category, and 988% more,in the case of the ship category.

    Again, we noted that Carrot2 grouped few documents of onecategory, the ship category, where only 56 (or less, because ofrepetition) of the 92 documents were taken into account (notshown).

    The number of groups found also deserves to be highlighted,five of ICE against 40 of Carrot2, the second largest difference thatwas obtained.

    5. Conclusions

    In this work, we proposed, presented and evaluated a clusteringgadget for Google Desktop, called ICE Intelligent ClusteringEngine.

    The main contribution of this work was to develop a new tool toimprove the quality of results offered by Google Desktop, by usingthe clustering technique.

    Comparing the results of ICE with those offered by Carrot2

    , itwas shown that the ICE gadget can find a number of groups muchcloser to the reality of the bases tested, not spreading the docu-ments among many small groups, promoting understanding ofthe relationship between the groups more clearly than Carrot2,and speeding up the desired information obtained. Our assessmentshows that in the experimental results the ICE gadget is able togroup similar documents.

    The weight tftf, embedded in the gadget, proved to be very use-ful to obtain large similar groups.

    The fact that Carrot2 only considers the text snippets returnedby Google Desktop, although fundamental in targeted searches tosites in the Internet, is a disadvantage when it comes to desktopsearch because, as the files are easily accessible, a clustering that

    takes into account the entire contents of the file tends to providefar more accurate results.

    Table 2

    Small data set.

    Oilseed (28) Bop (32) Cocoa (35)

    ICE LG 100% (26) 85% (17) 100% (32)Carrot2 LG 80% (8) 100% (10) 100% (17)ICE Total 100% 90.32% 100%Carrot2 Total 98.90% 88.88% 100%

    Table 3

    Medium data set.

    Gold (50) Coffee (67) Sugar (68)

    ICE LG 100% (44) 98.46% (64) 98.41% (62)Carrot2 LG 100% (7) 100% (29) 100% (11)ICE Total 97.96% 98.51% 98.46%

    Carrot2 Total 92.86% 91.67% 100%

    Table 4

    Large data set.

    Gnp (92) Ship (92) Interest (115)

    ICE LG 82.61% (76) 98.86% (87) 86.32% (101)Carrot2 LG 100% (19) 100% (8) 100% (23)ICE Total 82.61% 98.86% 86.32%Carrot2 Total 89.58% 90.16% 88.24%

    Table 5

    Varied data set.

    Cocoa (35) Coffee (67) Ship (92)

    ICE LG 100% (30) 96.97% (64) 96.66% (87)Carrot2 LG 100% (18) 100% (29) 100% (8)ICE Total 100% 97.06% 96.73%Carrot2 Total 97.36% 95.35% 96.55%

    L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 95249533 9531

  • 7/27/2019 Intelligent Clustering Engine

    9/10

    Another disadvantage of Carrot2 is that the technique employeddoes not generate clusters with unique elements, allowing thesame document to belong to more than one cluster. The same hap-pens with Ergo, and compared poorly with what Aduna AutoFocusis proposed to do.

    Suggestions for future works involve extending the system forthe clustering of other types of files, as well as other languages. An-

    other possibility would be to obtain access to the indexing archivesof Google Desktop, which would allow clustering any type of in-dexed document with this application (Broder, Glassman, Manasse,& Zweig, 1997).

    References

    Adar, E. (2011). GUESS: The graph exploration system. Visited September 2011.

    Aduna (2009a). Aduna AutoFocus. Visited July 2009.

    Aduna (2009b). Aduna Cluster Map Library. Visited July 2009.

    Aduna (2011). Search Aduna open source wiki. Visited September 2011.

    Agustn-Blas, L. E., Salcedo-Sanz, S., Jimnez-Fernndez, S., Carro-Calvo, L., Del. Ser,

    J., & Portilla-Figueras, J. A. (2012). A new grouping genetic algorithm forclustering problems. Expert Systems with Applications, 39, 96959703.

    Amazon.com, Inc. (2011). Amazon.com Associates Central Widgets. Visited September 2011.

    Avram, C. (2007). Using the options dialog Google Desktop APIs Google Code. VisitedSeptember 2011.

    Bou, B. (2011a). Treebolic. VisitedSeptember 2011.

    Bou, B. (2011b). treebolic j Download treebolic software for free at SourceForge.net. Visited September 2011.

    Bouthier, C. (2011). Hypertree Java Library. Visited September 2011.

    Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clusteringof the web. Computer Networks and ISDN Systems, 29, 11571166.

    Carlantonio, L. M., & Costa, R. M. E. M. (2009). Exploring a genetic algorithm forhypertext documents clustering. In N. Nedjah, L. de Macedo Mourelle, J.Kacprzyk, F. M. G. Frana, & A. F. de Souza (Eds.), Intelligent text categorizationand clustering. Studies in computational intelligence (Vol. 164, pp. 95117).Berlin/Heidelberg: Springer.

    Carrot Search (2011a). Carrot2 open source search results clustering engine. Visited September 2011.

    Carrot Search (2011b). Carrot2 clustering engine. Visited September 2011.

    Carrot Search (2011c). Carrot2 user and developer manual for version 3.6.0-dev. VisitedSeptember2011.

    Chang, D., Zhao, Y., Zheng, C., & Zhang, X. (2012). A genetic clustering algorithmusing a message-based similarity measure. Expert Systems with Applications, 39,21942202.

    Cura, T. (2012). A particle swarm optimization approach to clustering. ExpertSystems with Applications, 39, 15821588.

    Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/gather: Acluster-based approach to browsing large document collections. In Proceedingsof the 15th annual international ACM SIGIR conference on research anddevelopment in information retrieval (pp. 318329). New York, NY, USA: ACM.

    Filimon, T. (2007). Desktop gadgets: Rotating objects Google Desktop APIs Google Code. Visited September 2011.Filimon, T. (2008). Using parameters in desktop gadget programming GoogleDesktop APIs Google code. Visited September 2011.

    GoogleInc. (2010). TenwaysGmailmakes email easy andefficient. And maybeevenfun. Visited September2010.

    Google Inc. (2011a). Articles Google Desktop APIs Google Code. Visited September 2011.

    Google Inc. (2011b). Basics: search operators desktop for windows help. < http://desktop.google.com/support/bin/answer.py?hl=en&answer=10111> VisitedSeptember 2011.

    GoogleInc. (2011c). Creating a gadget Google Desktop APIs Google code. Visited September2011.

    Google Inc. (2011d). Gadget API reference Google Desktop APIs Google code.Visited September 2011.

    Google Inc. (2011e). Gadget designer Google Desktop APIs Google code. VisitedSeptember 2011.

    Google Inc. (2011f). Gmail: Email from Google. Visited September 2011.

    Google Inc. (2011g). Google. Visited September 2011.Google Inc. (2011h). Google Desktop. Visited

    September 2011.Google Inc. (2011i). Google Desktop features. Visited September 2011.Google Inc. (2011j). Google Desktop gadgets. Visited September 2011.Google Inc. (2011k). Google Desktop gadgets. Visited September 2011.Google Inc. (2011l). iGoogle. Visited September 2011.Google Inc. (2011m). Query API developer guide Google Desktop APIs Google

    code. Visited September 2011.

    Google Inc. (2011n). Tutorials Google Desktop APIs Google Code. VisitedSeptember 2011.

    Google Inc. (2011o). Using gadget designer Google Desktop APIs Google code. Visited September 2011.

    Google Inc. (2011p). Welcome Google Desktop APIs Google code. Visited September2011.

    Google Inc. (2011q). YouTube broadcast yourself. Visited September 2011.

    Groxis (2009). Grokker enterprise search management and content integration. Visited July 2009.

    Hruschka, E. R., & Ebecken, N. F. F. (2003). A genetic algorithm for cluster analysis.Intelligent Data Analysis, 7, 1525.

    Infoglobo Comunicao e Participaes S.A. (2011). O site O Globo:: Widgets. Visited September 2011 (in Portuguese).

    Invu Services Ltd. (2009). Ergo download. VisitedJuly 2009.

    Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACMComputing Surveys, 31, 264323.

    Jiang,S.,Pang,G.,Wu,M.,&Kuang,L.(2012).AnimprovedK-nearest-neighboralgorithmfortextcategorization. ExpertSystemswithApplications,39,15031509.

    Kartoo S. A. (2009). KartOO: The first interface mapping metasearchengine. Visited July 2009.

    Lee, W.-M. (2008). Professional Windows Vista gadgets programming. Indianapolis,Indiana, USA: Wiley Publishing, Inc.

    Lewis, D. D. (1997). Reuters21578 text categorization test collection distribution1.0. Visited September 2011.

    Liu, Y. C., Wu, C., & Liu, M. (2011). Research of fast SOM clustering for textinformation. Expert Systems with Applications, 38, 93259333.

    Madylova, A., & gdc, S

    . G. (2009). A taxonomy based semantic similarity ofdocuments using the cosine measure. In Proceedings of the 24th internationalsymposium on computer and information sciences (pp. 129134). Washington,DC, USA: IEEE.

    Manning, C. D., Raghavan, P., & Schtze, H. (2008). Flat clustering. Introduction toInformationRetrieval.NewYork,NY,USA:CambridgeUniversityPress.chapter16.

    Microsoft Corporation (2011a). MSN.com. VisitedSeptember 2011.

    Microsoft Corporation (2011b). The official Microsoft WPF and Windows forms site. Visited September 2011.

    Microsoft Corporation (2011c). Windows live gallery. VisitedSeptember 2011.

    Microsoft Corporation (2011d). XML paper specification: Overview. Visited September2011.

    NWB Team (2011). Network workbench j welcome. Visited September 2011.

    Olczyk, K. (2007). Going beyondscript: Developing hybrid desktop gadgets Google

    Desktop APIs Google code. Visited September 2011.

    OMadadhain, J., Fisher, D., Nelson, T., White, S., & Boey, Y.-B. (2011). JUNG JavaUniversal Network/Graph Framework. VisitedSeptember 2011.

    Schirmer,B. (2007). Letthe user choose your gadgets opacity GoogleDesktop APIs Google code. Visited September 2011.

    Song, W., Wang, S. T., & Li, C. H. (2009). Parametric and nonparametric evolutionarycomputing with a content-based feature selection approach for parallelcategorization. Expert Systems with Applications, 36, 1193411943.

    Song, W., Choi, L. C., Park, S. C., & Ding, X. F. (2011). Fuzzy evolutionary optimizationmodeling and its applications to unsupervised categorization and extractivesummarization. Expert Systems with Applications, 38, 91129121.

    Stucki, Y. (2007). Details views and YouTube videos in desktop gadgets GoogleDesktop APIs Google code. Visited September 2011.

    Thangaraj, B. (2007). Animation: Add life to your desktop gadget Google Desktop

    APIs Google code. Visited September 2011.

    9532 L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 95249533

    http://graphexploration.cond.org/http://graphexploration.cond.org/http://www.aduna-software.com/technologies/autofocus/overview.viewhttp://www.aduna-software.com/technologies/autofocus/overview.viewhttp://www.aduna-software.com/technologies/clustermap/overview.viewhttp://www.aduna-software.com/technologies/clustermap/overview.viewhttp://wiki.aduna-software.org/confluence/display/AFDOC/Searchhttp://wiki.aduna-software.org/confluence/display/AFDOC/Searchhttp://https//widgets.amazon.com/http://https//widgets.amazon.com/http://code.google.com/intl/en-US/apis/desktop/articles/5.htmlhttp://treebolic.sourceforge.net/en/index.htmlhttp://sourceforge.net/projects/treebolic/http://hypertree.sourceforge.net/http://project.carrot2.org/index.htmlhttp://search.carrot2.org/stable/searchhttp://search.carrot2.org/stable/searchhttp://download.carrot2.org/head/manual/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e6.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e6.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e18.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e18.htmlhttp://mail.google.com/mail/help/intl/en/about.htmlhttp://code.google.com/intl/en-Us/apis/desktop/articleshttp://code.google.com/intl/en-Us/apis/desktop/articleshttp://desktop.google.com/support/bin/answer.py?hl=en&answer=10111http://desktop.google.com/support/bin/answer.py?hl=en&answer=10111http://code.google.com/intl/en-US/apis/desktop/docs/script.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/script.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/gadget_apiref.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/designer.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/designer.htmlhttp://mail.google.com/mail?hl=en-UShttp://mail.google.com/mail?hl=en-UShttp://www.google.com/http://desktop.google.com/http://desktop.google.com/features.htmlhttp://desktop.google.com/features.htmlhttp://desktop.google.com.br/plugins/c/index/all.html?hl=en-UShttp://desktop.google.com.br/plugins/c/index/all.html?hl=en-UShttp://desktop.google.com.br/plugins?hl=en-ushttp://desktop.google.com.br/plugins?hl=en-ushttp://www.igoogle.com/http://code.google.com/intl/en-US/apis/desktop/docs/queryapi.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/Tutorials/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/Tutorials/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/Tutorials/GadgetDesigner/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/Tutorials/GadgetDesigner/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/index.htmlhttp://www.youtube.com/http://www.grokker.com/http://oglobo.globo.com/widgethttp://www.ergodownload.com/http://www.kartoo.com/http://www.kartoo.com/http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.htmlhttp://www.msn.com/http://windowsclient.net/http://gallery.live.com/results.aspx?bt=1&pl=1&ds=2&la=en&tier=0&st=1&p=1&c=0http://gallery.live.com/results.aspx?bt=1&pl=1&ds=2&la=en&tier=0&st=1&p=1&c=0http://msdn.microsoft.com/en-us/windows/hardware/gg463373http://msdn.microsoft.com/en-us/windows/hardware/gg463373http://nwb.slis.indiana.edu/http://code.google.com/intl/en-US/apis/desktop/articles/e8.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e8.htmlhttp://jung.sourceforge.net/http://code.google.com/intl/en-US/apis/desktop/articles/e13.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e13.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e12.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e12.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e7.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e7.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e7.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e7.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e12.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e12.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e13.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e13.htmlhttp://jung.sourceforge.net/http://code.google.com/intl/en-US/apis/desktop/articles/e8.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e8.htmlhttp://nwb.slis.indiana.edu/http://msdn.microsoft.com/en-us/windows/hardware/gg463373http://msdn.microsoft.com/en-us/windows/hardware/gg463373http://gallery.live.com/results.aspx?bt=1&pl=1&ds=2&la=en&tier=0&st=1&p=1&c=0http://gallery.live.com/results.aspx?bt=1&pl=1&ds=2&la=en&tier=0&st=1&p=1&c=0http://windowsclient.net/http://www.msn.com/http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.htmlhttp://www.kartoo.com/http://www.kartoo.com/http://www.ergodownload.com/http://oglobo.globo.com/widgethttp://www.grokker.com/http://www.youtube.com/http://code.google.com/intl/en-US/apis/desktop/docs/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/Tutorials/GadgetDesigner/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/Tutorials/GadgetDesigner/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/Tutorials/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/Tutorials/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/queryapi.htmlhttp://www.igoogle.com/http://desktop.google.com.br/plugins?hl=en-ushttp://desktop.google.com.br/plugins?hl=en-ushttp://desktop.google.com.br/plugins/c/index/all.html?hl=en-UShttp://desktop.google.com.br/plugins/c/index/all.html?hl=en-UShttp://desktop.google.com/features.htmlhttp://desktop.google.com/features.htmlhttp://desktop.google.com/http://www.google.com/http://mail.google.com/mail?hl=en-UShttp://mail.google.com/mail?hl=en-UShttp://code.google.com/intl/en-US/apis/desktop/docs/designer.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/designer.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/gadget_apiref.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/script.htmlhttp://code.google.com/intl/en-US/apis/desktop/docs/script.htmlhttp://desktop.google.com/support/bin/answer.py?hl=en&answer=10111http://desktop.google.com/support/bin/answer.py?hl=en&answer=10111http://code.google.com/intl/en-Us/apis/desktop/articleshttp://code.google.com/intl/en-Us/apis/desktop/articleshttp://mail.google.com/mail/help/intl/en/about.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e18.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e18.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e6.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/e6.htmlhttp://download.carrot2.org/head/manual/index.htmlhttp://search.carrot2.org/stable/searchhttp://search.carrot2.org/stable/searchhttp://project.carrot2.org/index.htmlhttp://hypertree.sourceforge.net/http://sourceforge.net/projects/treebolic/http://treebolic.sourceforge.net/en/index.htmlhttp://code.google.com/intl/en-US/apis/desktop/articles/5.htmlhttp://https//widgets.amazon.com/http://https//widgets.amazon.com/http://wiki.aduna-software.org/confluence/display/AFDOC/Searchhttp://wiki.aduna-software.org/confluence/display/AFDOC/Searchhttp://www.aduna-software.com/technologies/clustermap/overview.viewhttp://www.aduna-software.com/technologies/clustermap/overview.viewhttp://www.aduna-software.com/technologies/autofocus/overview.viewhttp://www.aduna-software.com/technologies/autofocus/overview.viewhttp://graphexploration.cond.org/http://graphexploration.cond.org/
  • 7/27/2019 Intelligent Clustering Engine

    10/10

    The Apache Software Foundation (2011a). LuceneFAQ Lucene-java Wiki. Visited September 2011.

    The Apache Software Foundation (2011b). Welcome to Apache Lucene! Visited September 2011.

    The Apache Software Foundation (2011c). Welcome to Apache Nutch. Visited September 2011.

    The Apache Software Foundation (2011d). Welcome to Solr. Visited September 2011.

    USA.gov (2009). USA.gov: The US Governments official web portal. Visited July 2009.

    Vivisimo Inc. (2010). Enterprise search provider Federated search, social search,clusteringjVivisimo, Inc. Visited September 2010.

    WebLib (2011). AllPlus Universal meta search and discovery engine. Visited September 2011.

    Wikimedia Foundation (2011a). Clustering Wikipdia, a enciclopdia livre. Visited September 2011 (inPortuguese).

    Wikimedia Foundation (2011b). Cosine similarity Wikipedia, the freeencyclopedia. VisitedSeptember 2011.

    Wikimedia Foundation (2011c). Gadget Wikipdia, a enciclopdia livre. Visited September 2011 (in Portuguese).

    Wikimedia Foundation (2011d). Genetic algorithm Wikipedia, the freeencyclopedia. VisitedSeptember 2011.

    Wikimedia Foundation (2011e). Hyperbolic tree Wikipedia, the free encyclopedia. Visited September 2011.

    Wikimedia Foundation (2011f). Java archive Wikipdia, a enciclopdia livre. Visited September 2011 (inPortuguese).

    Wikimedia Foundation (2011g). REST Wikipdia, a enciclopdia livre. Visited September 2011 (in Portuguese).

    Wikimedia Foundation (2011h). Silhouette (clustering) Wikipedia, the freeencyclopedia. VisitedSeptember 2011.

    Wikimedia Foundation (2011i). Wikipedia. VisitedSeptember 2011.

    Wikimedia Foundation (2011j). Windows presentation foundation Wikipedia, thefree encyclopedia. Visited September 2011.

    Xiao, J., Yan, Y., Zhang, J., & Tang, Y. (2010). A quantum-inspired genetic algorithmfor k-means clustering. Expert Systems with Applications, 37, 49664973.

    Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions onNeural Networks, 16, 645678.

    Yahoo! Inc. (2011a). Welcome to Flickr Photo sharing. Visited September 2011.

    Yahoo! Inc. (2011b). Yahoo! Visited September 2011.Yin, M., Hu, Y., Yang, F., Li, X., & Gu, W. (2011). A novel hybrid K-harmonic means

    and gravitational search algorithm approach for clustering. Expert Systems withApplications, 38, 93199324.

    Yi-Ouyang, Y.-O., Yun-Ling, Y.-L., & AnDing-Zhu, A.-Z. (2007). EHM-based web pagesfuzzy clustering algorithm. In Proceedings of the 2007 international conference onmultimediaand ubiquitous engineering(pp.561566).Washington, DC,USA: IEEE.

    Yippy Inc. (2010). Yippy. Visited September 2010.

    L. M. Carlantonio et al. / Expert Systems with Applications 39 (2012) 95249533 9533

    http://wiki.apache.org/lucene-java/LuceneFAQhttp://wiki.apache.org/lucene-java/LuceneFAQhttp://lucene.apache.org/http://lucene.apache.org/http://nutch.apache.org/http://nutch.apache.org/http://nutch.apache.org/http://lucene.apache.org/solrhttp://lucene.apache.org/solrhttp://www.usa.gov/http://www.usa.gov/http://www.vivisimo.com/http://www.allplus.com/http://www.allplus.com/http://pt.wikipedia.org/wiki/Clusteringhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://pt.wikipedia.org/wiki/Gadgethttp://pt.wikipedia.org/wiki/Gadgethttp://en.wikipedia.org/wiki/Genetic_algorithmhttp://en.wikipedia.org/wiki/Hyperbolic_treehttp://pt.wikipedia.org/wiki/Java_Archivehttp://pt.wikipedia.org/wiki/RESThttp://pt.wikipedia.org/wiki/RESThttp://en.wikipedia.org/wiki/Silhouette_(clustering)http://www.wikipedia.org/http://en.wikipedia.org/wiki/Windows_Presentation_Foundationhttp://en.wikipedia.org/wiki/Windows_Presentation_Foundationhttp://www.flickr.com/http://www.yahoo.com/http://www.yippy.com/http://www.yippy.com/http://www.yahoo.com/http://www.flickr.com/http://en.wikipedia.org/wiki/Windows_Presentation_Foundationhttp://en.wikipedia.org/wiki/Windows_Presentation_Foundationhttp://www.wikipedia.org/http://en.wikipedia.org/wiki/Silhouette_(clustering)http://pt.wikipedia.org/wiki/RESThttp://pt.wikipedia.org/wiki/RESThttp://pt.wikipedia.org/wiki/Java_Archivehttp://en.wikipedia.org/wiki/Hyperbolic_treehttp://en.wikipedia.org/wiki/Genetic_algorithmhttp://pt.wikipedia.org/wiki/Gadgethttp://pt.wikipedia.org/wiki/Gadgethttp://en.wikipedia.org/wiki/Cosine_similarityhttp://pt.wikipedia.org/wiki/Clusteringhttp://www.allplus.com/http://www.allplus.com/http://www.vivisimo.com/http://www.usa.gov/http://www.usa.gov/http://lucene.apache.org/solrhttp://lucene.apache.org/solrhttp://nutch.apache.org/http://nutch.apache.org/http://lucene.apache.org/http://lucene.apache.org/http://wiki.apache.org/lucene-java/LuceneFAQhttp://wiki.apache.org/lucene-java/LuceneFAQ