Large-scale data sharing by exploiting gossiping Esther Pacitti SOPHIA ANTIPOLIS - MÉDITERRANÉE...

Large-scale data sharing by exploiting gossiping

Esther Pacitti

Saphir

SOPHIA ANTIPOLIS - MÉDITERRANÉE

1st Gossple Workshop on Social Networking (december 2010)

Context: P2P Data Sharing• We consider P2P online communities where participants can be– Professionals (researchers, engineers, support staff, etc.) who

use web-scale collaboration in their workplace– Large scale of users and data (clouds, grids, internet)

• Example of applications:– P2P Recommendation Systems

• Useful for processing scientific workflows among participants’ peers

– P2P Query Reformulation• Clinical case sharing among doctors or physicians

– P2P CDN • Projects: – ANR DataRing (2009-2012, P2P online communities ) – Datluge (2010-2012, with UFRJ, Brazil on P2P scientific

workflows)

Chemistry, Materials Science and PhysicsBioinformatics

Computer Science

MOTIVATIONS

P2PRec: document recommender• Hudge graph:G = (D,U,E,T), where

– D is the set of shared documents– U is the set of users in the system– E is the set of edges between the users such that there is an edge e(u,v) if

users u and v are friends– T is the set of users’ topics of intrest.

• Problem: Given a query, recommend the most relevant documents

• Our approach– Reduce the research space by indentifing relevant users– Identify relevant users

• Users that stores/downloads enough high-quality documents, and become kind of providers in specific topics

• Recommended by trusted friends

• P2P Overlay : Semantic-Gossiping• Disseminate relevant users and their topics of intrests

P2PRec*: document recommender• Topics of intrest

– With respect to the documents a user store– Extracted automatically

• Friendship network– Explicit friendship (maybe laveraged with implicit)– Expresses users trusts– Implemented is FOAF files (friend of friends files, machine-readable vocabulary

serialized in RDF/XML )• Key-word Queries

– Mapped to topics– Mostly related to the user topics of intrest

• Mesure to – Check the similarity of users wrt to their topics (Dice coefficient)– Relevance of a user

*Joint work with F. Draidi, P. Valduriez, B. Kemme, to appear as Inria report

Semantic-Gossiping

u1 u2

u5

u6

u4

gossip

t1,t2

t1

t1

t3

t2,t3

user Gossip information

u5 t1, t2

u6 t2,t3


u4 t1

u1’s local-view before gossip

u5’s local-view before gossip


u5 t1, t2

u6 t2,t3

u4 t1


u4 t1

u6 t2,t3

u1’s view after gossip

u5’s local-view after gossip

If distance between uu and uv > τ ask for friendship

u1 topics: t1,t2

Friends:

link to u5 FOAF

u1 FOAF

Dice coefficient

If friendship is accepted add uv to FOAF file

u1

u5 topics

Relevant Users• Users topics of intrest are automatically extracted

using LDA*– by inspecting the documents topic vector

• A user is considered relevant on a topic tTu, if a percentage of its documents have high quality in topic t

• Each document doc at user u has– A rate given to doc : ratedoc

– doc topic Vector (extracted using LDA)• Vdoc={wdoc

t1,…..,wdoctd}

• doc is considered a high quality in a topic t qualityt(doc,u)

• If wdoct *ratedoc > a threshold value

• A user can be relevant in more than one topic

*Latent Dirichlet Allocation (topic classifier)

Query Processing• Implements Recommendation • Input: Key words• Output: – Links to a set of good quality documents. May include links to

documents on the topic of intrests of a friend (query expansion)

– Popularity and Similarity info• Example: doctors studing the behavior of a gene X may be glad to

learn about the deseases it can cause and check some experimental data sets

Query Processing

u1 u2

u3

u5

u6

u4

query q requesterq.t = t1, q.TTL=2

q.TTL=1 q.TTL=1 q.TTL=0

Summary of Docs similarity and classification info

query

Rec. docs

u7

t1,t2

t1

t1

t3

t2,t3

t2

Compute sim(doc,q) Compute sim(doc,q)

Compute sim(doc,q)

u1 topics of intrests

Friends:

link to u5 FOAF

1) Query q is mapped to a topic or topics Tq

2) Select Top-k friends in the FOAF wrt to the query topics (cosine similarity) 3) Redirect Query4) Do 2) and 3) Recursively until TTL

u1 FOAF

u5 topics

Conclusions P2PRec• P2PRec (BDA2010)– Find friends (relevant users on similar topics) while gossiping– Query processing exploits relevant users wrt to the query

topics, recursively (FOAF friends)

– Perf. Evaluation• Recall x Precision x Response Times

– Limitation of LDA: needs some centralization for training, but good to validate our general approach

– However there are other possibilities: • Ontology based automatic annotation • This exists for biomedical documents

P2P Query Reformulation*• P2P Data Management System (PDMS)• Each peer has:– Its own schema (and data)– 1 or more mapping acquaintances to/from which at least

1 mapping rule exists• Goal: Given a query, exploit mapping acquaintances as much

a possible to enhance query responses.

AB

Schema B

__________

Schema A __________

Mb,a

data data

*Joint work with A. Bonifati, G. Summa, P. Valduriez, to appear as Inria report

?= Hospital(x, “San Francisco”)

Concepts

Schema__________

Schema__________

MAPPING RULE

Hospital(x, y) ⇢ HealthCareInst(x, y, z)

B A

Source Hospital [0..*] name location Grant [0..*] amount istitution manager Doctor [0..*] name salary

Target HealthCareInst [0..*] name city id Grant [0..*] amount scientist

Mb,a BODY HEAD

atoms

Mb,a ALONG?= Q

Hospital($X, “San Francisco”) HealtCareInst($X, “San Francisco”, $Z)

?= Q’

data data

Mapping Relevance• Each time a query gets translated by exploiting a mapping we got a Relevant

Rewriting• The relevance can be Forward (along) or Backward (against) depending on the

matched side of the mapping• Goal:

– Collect as many rewriting as possible– Find the most intresting paths to take (avoid useless paths)

M1 Hospital(x, y) ⇢ HealthCareInst(x, y, z)

M2 Institution(x, y, z) ⇢ Hospital(x, y)


Problem

AB

D

Mb,a

Mb,d

?= Q

Z

Mb,z

ALONG

ALONG

?= Q’

?= Q’’

AGAINSTC

Mc,a

…

.

H

LM

1) How to choose the most relevant paths to undertake in the reformulation task?2) Are there other peers in the network which can be contacted?

Acquaintances• Gossiping acquaintances– Potential friends that dynamically appears in the local

semantic view (LSV)• Mapping acquaintance– There is at least 1 direct mapping towards it (friend)– Established manually

• Social acquaintance (FOAF friend)– No direct mapping is needed towards it– There are some common interests– Established explicitly

Our Approach

• Gossip to disseminate mapping rules information to find friends

• Users topics of intrest – are expressed according to the schema information or

past queries topics• Measure to– Compute the relevance of a mapping wrt to a query– Compute similarity between users

• Exploits recursively (to translate a query)– Mapping acquaintances– Social acquaintances

Gossiping Acquaintances

Social Acquaintances• Friend

– Share common topics of – interests

• Interests– Formulated by queries– Elements of peer’s schema

• Approach: use the semantic view to discover friends


?= State( y, z, “California”)

?= Doctor( w, k)

?= Patology(“heart”, x)

… … …

Schema__________

Compute RelevanceGoal: Given an Query and a mapping rule, determine if

the mapping is relevant to the queryMethod (Standard Match Semantics)– Atom Label matching– Parameters compatibility

M1 Hospital(x, y) AND State (x,z) ⇢ HealthCareInst(x, y, z)


M2 Hospital(x, y,w) AND State (x,z) ⇢ HealthCareInst(x, y, z)

M3 Ospedale(x,y) AND State (x,z) ⇢ HealthCareInst(x, y, z)

Compute Relevance• AF-IMF Measure, inspired by TF-IDF*• AF (Atom Frequency)– Local measure, establishing the importance of the

query atom in the current mapping

• IMF (Inverse Mapping Frequency)– Distributed measure, establishing the overall

importance of the query atom• Relevance of a mapping wrt to q is AF * IMF

*term frequency-inverse document frequency

Compute Relevance (AF) About the applied measure◦ To increase the effectiveness of the measure we distinguish,

again, Forward/Backward relevance

M1 Hospital(x, y) AND State (x,z) ⇢ HealthCareInst(x, y, z)

body

head

M2 Institution(x, y, z) ⇢ Hospital(x, y)


FORWARD MEASURE

BACKWARD MEASURE

AF = 1/2

AF = 1

Compute Relevance (IMF)

• IMF requires a way to get a value for – The total number of mappings – The total number of mappings containing that

atom

• To do that, we can inspect the semantic view of the peer– Also by sending inquiries to peers in the FOAF

Translate-Query

• Compute Relevance on Local Mappings wrt Q– Choose the TopK Mappings– Apply the translation semantics, along/against the

mapping direction– Trigger Translate-Query on the mapping

acquaintance, recursively (until TTL)• Select FOAF friends to be contacted– By looking at the best Mapping summaries wrt Q– Trigger query Translate-Query on the social

acquaintance, recursively (until TTL)

Performance Evaluation

• Baseline– No gossiping, original query propagated

• Baseline+– No gossiping, translated query

propagated• Baseline#

– No gossiping, translated query propagated, local measure to sort mappings (by using AF only)

• Full-– Gossiping, translated query propagated,

AF-IMF measure to sort mappings, no FOAF links (only local mappings)

• Full (P2PRec)– Gossiping, translated query propagated,

AF-IMF measure to sort mappings, FOAF links exploited

TOPKM-1 TOPKM-5 TOPKM-10

TOPKM-15

TOPKM-20

TOPKM-25

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

RecallBaseline Baseline+ Baseline#Full- Full

TopK Mapping ThresholdRe

call

Effectiveness of AF-IMF, LSV and gossiping

Conclusions• P2P Query Reformulation– Gossiping is used to disseminated mappings rules

information– Exploits recursively relevant mappings • Mapping acquaintances• Social acquaintances

– Initial Perf. Resuts:• Very good recall results (over 90%)• Linear scale-up• Trade-off of Recall and Responses Times

– Previous work uses• DHTs or a centralized mediation model.

About Montpellier

Best quality of life in FranceImportant laboratories (LIRMM) and research instituts (INRA, CIRAD, etc)University of Montpellier is part of the « opération campus »Soon we will have a direct TGV line to Barcelona (1 hour)

Large-scale data sharing by exploiting gossiping Esther Pacitti SOPHIA ANTIPOLIS - MÉDITERRANÉE...

Documents

Transcript of Large-scale data sharing by exploiting gossiping Esther Pacitti SOPHIA ANTIPOLIS - MÉDITERRANÉE...