Plateforme de Recommandation Distribuée et Diversifiée pour les ...

6
HAL Id: lirmm-01088733 https://hal-lirmm.ccsd.cnrs.fr/lirmm-01088733 Submitted on 28 Nov 2014 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Plateforme de Recommandation Distribuée et Diversifiée pour les Sciences Citoyennes Maximilien Servajean, Esther Pacitti, Miguel Liroz-Gistau, Alexis Joly, Julien Champ To cite this version: Maximilien Servajean, Esther Pacitti, Miguel Liroz-Gistau, Alexis Joly, Julien Champ. Plateforme de Recommandation Distribuée et Diversifiée pour les Sciences Citoyennes. BDA: Bases de Données Avancées, Oct 2014, Grenoble-Autrans, France. 2014, Gestion de Données – Principes, Technologies et Applications. <http://bda2014.imag.fr>. <lirmm-01088733>

Transcript of Plateforme de Recommandation Distribuée et Diversifiée pour les ...

HAL Id: lirmm-01088733https://hal-lirmm.ccsd.cnrs.fr/lirmm-01088733

Submitted on 28 Nov 2014

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Plateforme de Recommandation Distribuée etDiversifiée pour les Sciences Citoyennes

Maximilien Servajean, Esther Pacitti, Miguel Liroz-Gistau, Alexis Joly, JulienChamp

To cite this version:Maximilien Servajean, Esther Pacitti, Miguel Liroz-Gistau, Alexis Joly, Julien Champ. Plateformede Recommandation Distribuée et Diversifiée pour les Sciences Citoyennes. BDA: Bases de DonnéesAvancées, Oct 2014, Grenoble-Autrans, France. 2014, Gestion de Données – Principes, Technologieset Applications. <http://bda2014.imag.fr>. <lirmm-01088733>

PlantRT : a Distributed and DiversifiedRecommendation Tool for Citizen Sciences∗

Plateforme de Recommandation Distribuee et Diversifiee pour les Sciences Citoyennes

Maximilien Servajean∗ , Esther Pacitti∗, Miguel Liroz-Gistau†, Alexis Joly† and Julien Champ‡∗Inria & Lirmm, University of Montpellier, Email : {servajean, pacitti}@lirmm.fr

†Inria & Lirmm, Email : {miguel.liroz gistau, alexis.joly}@inria.fr‡Inria & Lirmm, Email : [email protected]

Resume— De nombreux domaines scientifiques pro-duisent et consomment des volumes de donnees consi-derables (e.g. biologie, astronomie, physique). Dans cetravail, nous nous interessons au traitement de donneesde types images, produites a grande echelle par desbotanistes dans le cadre de Pl@ntNet 1. L’objectif denotre prototype, PlantRT, est de permettre l’etude del’evolution et des correlations de familles de plantesdiverses. Chaque image, ou observation, est ainsi pro-duite de facon personnelle – par chaque citoyen ou bo-taniste participant au projet –, et peut etre stockee surdifferents types de sites (e.g. ordinateurs personnels,smartphones, serveurs, clouds). Par ailleurs, l’emer-gence des systemes de recommandation distribues fa-vorise le partage, la decouverte et la mise en relationde ces donnees produites par chaque citoyen implique.Cet article presente PlantRT un prototype de systemede recommandation multi-sites, prenant en compte ladiversite des profils des citoyens, ou utilisateurs, etdes donnees, favorisant, par exemple, la decouverte denouvelles especes de plantes d’une meme famille oude la meme zone geographique ainsi que le passage al’echelle. Nous montrons ici, en particulier, plusieurscas d’utilisations ainsi que le deploiement de PlantRTen details.

I. Introduction

Many fields of science are currently massive producersof diverse data items, or items (e.g. images, experimentaldatasets, plant observations). For instance, in botany, theemergence of citizen sciences has fostered the creation oflarge and structured communities of nature observers (e.g.e-bird 2, xeno-canto 3, Tela Botanica 4, Pl@ntNet [3]) whostarted to produce outstanding collections of image records(i.e. items). These images are captured by volunteer natureobservers, and stored in different and heterogeneous sites(e.g. PCs, Smartphones, Clouds, Serveurs). Building anaccurate knowledge of the identity, the geographic dis-tribution and the evolution of living species is essentialfor a sustainable development of humanity as well as

∗Work conducted within the Institut de Biologie Computation-nelle and partially funded by the labex NUMEV and the CNRSproject Mastodons.

1. http://www.plantnet-project.org2. http://ebird.org3. http://www.xeno-canto.org4. http://www.tela-botanica.org

for biodiversity conservation. However, scaling up suchcollaborative approaches to real-world ecological surveil-lance systems involving millions of contributors is stillchallenging.

The emergence of distributed search and recommenda-tion systems favors personalized retrieval of items (e.g.images). In such systems, each user shares a set of itemsin its local node. Then, given a user u and a query q –generally keyword based –, the goal is to recommend itemsin a personalized way. Generally, the recommended itemsare chosen based only on the query initiator’s profile andon the query itself. Diversified search and recommendationis a personalized method [6] that, given a query q and thequery initiator u, recommends items taking into accountthe query, u’s profile, and the diversification of both therelevant items and the profiles of the users sharing them. Inthe context of Pl@ntNet, user profiles are defined based onthe observations made by each user. Diversification enablesthe discovery of items referring to plants from differentspecies within the same family, or concerning the samegeographical area. This is very useful for botanists whowish to understand the plants biodiversity.

In [2] we proposed a distributed peer-to-peer solutionfor search and recommendation with no diversification.Here we propose a generic multi-site approach in the sensethat sites may be heterogeneous (e.g. PCs, Smatphones,Clouds). As presented, each user has her own profile basedon the items she shares. However, for multi-users sites(i.e. Clouds, Servers), we propose to exploit the conceptof virtual nodes. That is, in multi-users sites, in additionto the users’ profiles, we adopt the approach of clusteringthem through k-means [1]. Therefore each multi-users sitewill also be composed of k virtual nodes’ profiles (i.e.means). The set of users that are similar with a specificmeans, defines a virtual node. Virtual nodes are useful todefine the logical overlay topology for distributed searchand recommendation and may be implemented as a simplethread. In addition, virtual nodes enables to index itemsin a personalized and efficient manner [1].

In our demo, we show in details how PlantRT wasdeployed over 10 multi-sites (i.e. Clouds’ virtual machinesand a PC) using virtual nodes. Our demo uses a dataset

ITEMS

Amazon WS Site

Google App Engine SiteLocal Computer Site

VN1

VN2

Windows Azure Site

VN1's Virtual Network

VN2's Virtual Network

Figure 1: Example of a PlantRT multi-site network with 4 sites.

provided by Tela Botanica containing more than 10, 000observations produced by 1, 500 real citizens. We placeditems on different sites and we demonstrate the behaviourof PlantRT with different types of queries, submitted byusers having various profiles, to show that our distributedrecommendation approach provides good results in termsof diversification, recall and performance. For this purpose,we create several scenarios varying the number of users(and virtual nodes), sites and the size of the corpus. Inaddition, we also demonstrate the efficiency of the dis-tributed TF×IDF algorithm, necessary to index the itemsof the distributed corpus using the images meta-data.This indexation is useful to create the inverted lists, thusenabling top-k processing for search and recommendation.To the best of our knowledge, this is the first work thatprovide a real demonstration of a complete distributed anddiversified recommendation tool using real scientific data.

This paper is structured in the following way. First, Sec-tion II presents an overview of PlantRT’s architecture.Followed by Section III, which presents the details of ourdistributed TF × IDF solution. Section IV describes thedemonstration itself, including the use case and detailsabout the prototype’s deployment. Finally, Section V con-cludes the paper.

II. PlantRT Architecture’s Overview

In this section, PlantRT’s architecture is discussed.An important element of the prototype is the conceptof virtual nodes. That is, each site hosts 1 to n virtualnodes, each referring to a set of 1 to m users having sim-ilar profiles, and produced using k-means. Virtual nodeshave two benefits. First, they enable to index items in apersonalized way [1], thus improving top-k performances.Second, they permit to define the logical overlay topology,noted VN-Network, using gossip protocols. Notice thatthe VN-Network is used during query processing to guiderecommendation among sites.

As presented, PlantRT distributed search and rec-ommendation platform is composed of multi-sites (e.g.server, PC, cloud). Each site hosts 1 to n virtual nodes,each referring to a set of 1 to m users having similarprofiles. Each user shares a set of items (i.e. observationsimages) that are associated to meta-data and used toproduce her profile [6]. The meta-data refer to the plant’sfamily, species and genus, the observation’s location anda description. Meta-data and users’ profiles are presentedthrough a TF × IDF vector, the latter being used by k-means to produce the virtual nodes. Notice that TF×IDFis computed in a completely distributed manner, thusavoiding centralization, as explained in Section III.

In our approach, virtual nodes are used both to person-alize indexation and to facilitate the distributed overlayconstruction – users with different profiles will forwardtheir queries through the overlay to different sites.

Thus, each virtual node holds an index (inverted lists)that refers to the items of the related site. Inverted listsassociate for each keyword the set of items that containit. Finally, the site items are ranked in a personalizedway with respect to the users’ profile of the correspondingvirtual node [1]. This enables to improve top-k processingperformance over the inverted lists. Notice that virtualnodes of a same site, have access to all the items of thatsite. Figure 1 presents an example composed of four sites,three being virtual machines in the cloud and one beinga local computer. In this example, we focus on the caseof the Windows Azure’s site where two virtual nodes havebeen produced; also, the site is associated to a databasecontaining all items shared in it.

Additionally, virtual nodes are connected togetherthrough a distributed overlay called VN-network. It en-ables to connect diverse virtual nodes taking into accountprofile diversity. That is, a virtual node has its own profile,corresponding to its center (i.e. average of all profiles in thevirtual node), computed during the clustering process with

k-means. By diversifying the VN-network which is usedto guide recommendation, both results diversification andrecall levels are increased while latency is minimized [7].In Figure 1, each virtual node of the Windows Azure’ssite is connected to two other virtual nodes owned by twodifferent sites. The first one, VN 1 is connected to bothAmazon and Google’s sites while the second one, VN 2, isconnected to both the local computer and Google’s site.

The VN-network overlay is built in two steps:

1) based on random gossiping, each site is aware ofremote virtual nodes available on the network,

2) at each gossip exchange, each local virtual nodeapplies a distributed profile diversification clusteringalgorithm [7], to build its VN-network.

Finally, to retrieve items, users submit queries. A queryis generally keyword based q = k1, ..., kp.Whenever a useru submits a query q, it is forwarded to all nodes in its localvirtual node’s VN-network. Each virtual node receiving thequery repeats the same procedure until the query TTL(i.e. time to live) is reached. Finally, each involved virtualnode, including u’s one, returns its relevant items, selectedwithin its site, with respect to q and u’s profile, by usinga specific similarity measure (e.g. Jaccard).

III. PlantRT Distributed TF × IDF Indexing

As presented in Section II, in PlantRT, items refer toimages that are associated to textual meta-data. Thus, inorder to be indexed in inverted lists, items are representedby a TF × IDF [4] vector.

In this section, we present in details our distributedsolution for TF × IDF computation because it is anoriginal solution for distributed indexation that does notdepend on any centralized authority. Recall that TF×IDFis a score that defines the importance of a term given anitem and the global corpus (i.e. set of all items storedin the whole system), which, in our case, is completelydistributed. The intuition behind our solution is thatwe only need average statistics of the global corpus tocompute the full score.

More precisely, the TF × IDF score of a given term tis obtained by multiplying the term frequency (TF ) byits inverse document (i.e. item) frequency (IDF ) withinthe global corpus I. While the first part can be computedlocally, the latter needs a distributed protocol since theglobal corpus is not available. Thus, to compute IDF ateach site s, we use a gossip-based protocol. Generally, IDFis computed as follows:

IDF (t, I) = log|I|Ft/I

(1)

Where Ft/D is the number of items in the whole corpus Ithat contain the term t.

The goal of our distributed approach is to compute theaverage number of items per site, noted |I|/n, and for eachterm t, the average number of items that contain it per site,noted

Ft/D

n , where n is the total number of sites. Then, we

q = asteraceaeUn-diversified Profile Diversity

Elodie Dujardin Marie Dupont Pierre Durand

Table I: Search and recommendation example withprofile diversity.

simply exploit the property of convergence of gossip-basedprotocols applied to P2P average computing [5]. Based onthese averages, each site will be able to compute the IDFvector locally.

IV. PlantRT Demonstration

In this section, we show characteristics of PlantRT.First in Section IV-A, we present the use case of ourdemonstration. Then in section IV-B, we show details ofthe deployment and of the behavior of the prototype.

A. PlantRT Use Case

Recall that we rely on diversified search and recom-mendation that takes into account the query, the queryinitiator’s profile, and the diversification of both the rele-vant items and the profiles of the users sharing them. Herewe show the behavior of PlantRT over our distributedplatform with different types of queries, submitted byusers having various profiles, to show that our distributedrecommendation approach provides good results in termsof recall and diversification.

Table I presents the results obtained by executing thequery “asteraceae” – which corresponds to a very largefamily of plants, including the well known daisy – overour whole dataset of 10, 000 observations and 1, 500 users.We chose three users to analyze the returned items.

In Table I, each column refers to the results obtainedby a single user. In the first column, corresponding toElodie Dujardin, results are un-diversified. In the twoother columns, corresponding to Marie Dupont and PierreDurand, results have been obtained using profile diver-sification. Figure 2a, 2b and 2c presents the profiles ofrespectively Elodie Dujardin, Marie Dupont and PierreDurand. Recall that the users’ profiles are derived from

ASTERACEAE, Leucanthemum atratum - Montpellier (34)

ASTERACEAE, Leucanthemum adustum - Montpellier (34)

ASTERACEAE, Cichorium intybus - Montpellier (34)...(a) Elodie’s Profile

ASTERACEAE, Cirsium vulgare - Palavas (34)ASTERACEAE, Cichorium intybus - Paris (75)

ASTERACEAE, Crepis vesicaria - Montpellier (34)...(b) Marie’s profile.

RHAMNACEAE, Ramnus alaternus L. - Montpellier (34)

ASTERACEAE, Senecio doronicum - Montpellier (34)

ASTERACEAE, Echinops ritro L. - Montpellier (34)

...(c) Pierre’s profile.

Figure 2: Users’ profiles.

their observations; therefore each line refers to the ob-served plant’s family and species associated to the locationof the observation.

Not diversifying the results leads to the first columnwhich only contains very redundant plants observations. Inthe other hand, diversifying the results enables to retrievea broader spectrum of plants species. For instance, the firstresult of Marie Dupont is a plant which species is Cirsiumvulgare (Savi) Ten, while the second one corresponds to thespecies Crepis vesicaria. Similarly, Pierre’s first results isSenecio doronicum, while the second one is Echinops ritro.These species can be observed in their respective profile inFigure 2.

We now show how our diversified recommendationmethod can be applied in the case where a user wishes toretrieve plants observations made in the geographical areawhere she is. In this case, the query is the geo positionand relevant items are chosen taking into account thegeographical area around q, its initiator’s profile and thediversification of both the relevant items and the profilesof the users sharing them.

For instance, in Figure 3, Pierre Durand went hikingaround Montpellier. He is represented by the blue dotin the center of the map, thus referring to the queryq = 〈latitude :43.74〉, 〈longitude :3.9〉. The results of thequery are represented by the red pins. They refer todiversified observations made in the area. For instance, inthe figure, we zoomed on an observation of a plant fromthe Rahmnaceae’s family.

B. PlantRT Deployment and Robustness

In our demo, we show the diverse steps necessary to de-ploy PlantRT over a multi-site architecture using virtualnodes. In addition, we show that PlantRT is robust andperforms well.

Two main steps are needed to deploy PlantRT. Thefirst step consists in preparing the environment for theprototype; the second one consists in preparing PlantRTto be executed in the previous environment.

First, an environment must be created to hostPlantRT. This environment refers to the machine (e.g.local computer, server or a virtual machine in the cloud)and the set of applications needed to host the prototype.

More precisely, two applications must be installed, onebeing the application server and the other one being adatabase. . The database is used to store all metadatasuch as users’ profiles or items shared. In our case, we useMongoDB 5. Once the first step is complete, PlantRTmust be configured to be run in this specific environment.The configuration consists in just a few informations: theIP of an existing site to connect to, the maximum numberof virtual nodes, the frequency of the gossip protocol andthe frequency of the distributed TF × IDF protocol. Onceconfigured, PlantRT can be uploaded in the applicationserver (i.e. tomcat 6) and will starts automatically.

During the initialization, several operations are per-formed by PlantRT automatically. First, the virtualnodes are built and associated to a set of inverted lists.In our case, each set of inverted lists is an instance ofLucene 7. Then, the user – or administrator in the caseof multi-users site – specified an IP address to whichPlantRT will connect. The IP refers to an existingremote PlantRT’s site sr. Once the connection is estab-lished between the two sites, virtual nodes from sr areretrieved to the local site and the gossip protocol is startedand executed periodically at a system defined frequency.The gossip view (i.e. random view) are implemented assimple arrays. The last protocol to be initialized andstarted is TF × IDF distributed protocol since it relies onthe gossip view to connect to other sites. The initializationconsists in building a TF × IDF vector with the site’slocal corpus. Once initialized, periodical exchanged areestablished to compute the TF × IDF vector. Notice

5. http://www.mongodb.org6. http://tomcat.apache.org7. http://lucene.apache.org

Figure 3: PlantRT’s example of geo recommenda-tion next to Montpellier, France.

that this protocol runs indefinitely to take into accountnew items added to any sites. After this final step, thePlantRT becomes accessible to the users.

In our demo, we show that virtual nodes and a diver-sified VN-Network enable to achieve very good resultsin term of diversification, recall (i.e. the proportion ofresults answering a query retrieved) and performance.More precisely, with different numbers of sites and virtualnodes, we show that PlantRT reaches centralized-likerecall results. Additionally, we have simulated up to 6, 000nodes still reaching recall results of 99, 9 and very goodperformance [7].

The distributed TF × IDF protocol shows very goodconvergence properties. In our demo, we show that only afew exchanges at each site are needed to compute a TF ×IDF vector with respect to the global corpus.

V. Conclusion

This paper presents PlantRT a multi-sites diversifiedsearch and recommendation tool for citizen sciences, andmore precisely, for botanists.

We proposed an original multi-site platform that is builtover virtual nodes. This enables to reduce network costsand to improve performances of top-k processing. Ourdemo is done using real scientific data over a concretemulti-site platform. We show several use cases for diversi-fied search and recommendation. In addition, we show, indetails, how we deployed ou platform.

The still evolving source code is available at thefollowing address: http://www2.lirmm.fr/˜servajean/prototypes/plant-sharing/plant-rt.html.

References

[1] S. Amer-Yahia and M. Benedikt. Efficient network aware searchin collaborative tagging sites. VLDB Endowment ’08, 1(1):710–721, 2008.

[2] Fady Draidi, Esther Pacitti, Didier Parigot, and GuillaumeVerger. P2Prec: a Social-Based P2P Recommendation System.In CIKM, pages 2593–2596, 2011.

[3] A. Joly, H. Goeau, P. Bonnet, B. Vera, J. Barbe, S. Souheil,Y. Itheri, J. Carre, E. Mouysset, J.F. Molino, N. Boujemaa, andD. Barthelemy. Interactive plant identification based on socialimage data. In Ecological Informatics, 2013.

[4] KS Jones. A statistical interpretation of term specificity and itsapplication in retrieval. Journal of documentation, 1972.

[5] W. Kowalczyk, M. Jelasity, and AE. Eiben. Towards data miningin large and fully distributed peer-to-peer overlay networks. InBNAIC, pages 203–210, 2003.

[6] Maximilien Servajean, Esther Pacitti, Sihem Amer-Yahia, andPascal Neveu. Profile Diversity in Search and Recommendation.In WWW Companion, pages 973–980, 2013.

[7] Maximilien Servajean, Esther Pacitti, Miguel Liroz-Gistau, Si-hem Amer-Yahia, and Amr El Abbadi. Exploiting Diversity inGossip-Based Recommendation. In Globe and BDA (submitted),2014.