Large-scale data sharing by exploiting gossiping Esther Pacitti SOPHIA ANTIPOLIS - MÉDITERRANÉE...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of Large-scale data sharing by exploiting gossiping Esther Pacitti SOPHIA ANTIPOLIS - MÉDITERRANÉE...
Large-scale data sharing by exploiting gossiping
Esther Pacitti
Saphir
SOPHIA ANTIPOLIS - MÉDITERRANÉE
1st Gossple Workshop on Social Networking (december 2010)
Context: P2P Data Sharing• We consider P2P online communities where participants can be– Professionals (researchers, engineers, support staff, etc.) who
use web-scale collaboration in their workplace– Large scale of users and data (clouds, grids, internet)
• Example of applications:– P2P Recommendation Systems
• Useful for processing scientific workflows among participants’ peers
– P2P Query Reformulation• Clinical case sharing among doctors or physicians
– P2P CDN • Projects: – ANR DataRing (2009-2012, P2P online communities ) – Datluge (2010-2012, with UFRJ, Brazil on P2P scientific
workflows)
Chemistry, Materials Science and PhysicsBioinformatics
Computer Science
MOTIVATIONS
P2PRec: document recommender• Hudge graph:G = (D,U,E,T), where
– D is the set of shared documents– U is the set of users in the system– E is the set of edges between the users such that there is an edge e(u,v) if
users u and v are friends– T is the set of users’ topics of intrest.
• Problem: Given a query, recommend the most relevant documents
• Our approach– Reduce the research space by indentifing relevant users– Identify relevant users
• Users that stores/downloads enough high-quality documents, and become kind of providers in specific topics
• Recommended by trusted friends
• P2P Overlay : Semantic-Gossiping• Disseminate relevant users and their topics of intrests
P2PRec*: document recommender• Topics of intrest
– With respect to the documents a user store– Extracted automatically
• Friendship network– Explicit friendship (maybe laveraged with implicit)– Expresses users trusts– Implemented is FOAF files (friend of friends files, machine-readable vocabulary
serialized in RDF/XML )• Key-word Queries
– Mapped to topics– Mostly related to the user topics of intrest
• Mesure to – Check the similarity of users wrt to their topics (Dice coefficient)– Relevance of a user
*Joint work with F. Draidi, P. Valduriez, B. Kemme, to appear as Inria report
Semantic-Gossiping
u1 u2
u5
u6
u4
gossip
t1,t2
t1
t1
t3
t2,t3
user Gossip information
u5 t1, t2
u6 t2,t3
user Gossip information
u4 t1
u1’s local-view before gossip
u5’s local-view before gossip
user Gossip information
u5 t1, t2
u6 t2,t3
u4 t1
user Gossip information
u4 t1
u6 t2,t3
u1’s view after gossip
u5’s local-view after gossip
If distance between uu and uv > τ ask for friendship
u1 topics: t1,t2
Friends:
link to u5 FOAF
u1 FOAF
Dice coefficient
If friendship is accepted add uv to FOAF file
u1
u5 topics
Relevant Users• Users topics of intrest are automatically extracted
using LDA*– by inspecting the documents topic vector
• A user is considered relevant on a topic tTu, if a percentage of its documents have high quality in topic t
• Each document doc at user u has– A rate given to doc : ratedoc
– doc topic Vector (extracted using LDA)• Vdoc={wdoc
t1,…..,wdoctd}
• doc is considered a high quality in a topic t qualityt(doc,u)
• If wdoct *ratedoc > a threshold value
• A user can be relevant in more than one topic
*Latent Dirichlet Allocation (topic classifier)
Query Processing• Implements Recommendation • Input: Key words• Output: – Links to a set of good quality documents. May include links to
documents on the topic of intrests of a friend (query expansion)
– Popularity and Similarity info• Example: doctors studing the behavior of a gene X may be glad to
learn about the deseases it can cause and check some experimental data sets
Query Processing
u1 u2
u3
u5
u6
u4
query q requesterq.t = t1, q.TTL=2
q.TTL=1 q.TTL=1 q.TTL=0
Summary of Docs similarity and classification info
query
Rec. docs
u7
t1,t2
t1
t1
t3
t2,t3
t2
Compute sim(doc,q) Compute sim(doc,q)
Compute sim(doc,q)
u1 topics of intrests
Friends:
link to u5 FOAF
1) Query q is mapped to a topic or topics Tq
2) Select Top-k friends in the FOAF wrt to the query topics (cosine similarity) 3) Redirect Query4) Do 2) and 3) Recursively until TTL
u1 FOAF
u5 topics
Conclusions P2PRec• P2PRec (BDA2010)– Find friends (relevant users on similar topics) while gossiping– Query processing exploits relevant users wrt to the query
topics, recursively (FOAF friends)
– Perf. Evaluation• Recall x Precision x Response Times
– Limitation of LDA: needs some centralization for training, but good to validate our general approach
– However there are other possibilities: • Ontology based automatic annotation • This exists for biomedical documents
P2P Query Reformulation*• P2P Data Management System (PDMS)• Each peer has:– Its own schema (and data)– 1 or more mapping acquaintances to/from which at least
1 mapping rule exists• Goal: Given a query, exploit mapping acquaintances as much
a possible to enhance query responses.
AB
Schema B
__________
Schema A __________
Mb,a
data data
*Joint work with A. Bonifati, G. Summa, P. Valduriez, to appear as Inria report
?= Hospital(x, “San Francisco”)
Concepts
Schema__________
Schema__________
MAPPING RULE
Hospital(x, y) ⇢ HealthCareInst(x, y, z)
B A
Source Hospital [0..*] name location Grant [0..*] amount istitution manager Doctor [0..*] name salary
Target HealthCareInst [0..*] name city id Grant [0..*] amount scientist
Mb,a BODY HEAD
atoms
Mb,a ALONG?= Q
Hospital($X, “San Francisco”) HealtCareInst($X, “San Francisco”, $Z)
?= Q’
data data
Mapping Relevance• Each time a query gets translated by exploiting a mapping we got a Relevant
Rewriting• The relevance can be Forward (along) or Backward (against) depending on the
matched side of the mapping• Goal:
– Collect as many rewriting as possible– Find the most intresting paths to take (avoid useless paths)
M1 Hospital(x, y) ⇢ HealthCareInst(x, y, z)
M2 Institution(x, y, z) ⇢ Hospital(x, y)
?= Hospital(x, “San Francisco”)
Problem
AB
D
Mb,a
Mb,d
?= Q
Z
Mb,z
ALONG
ALONG
?= Q’
?= Q’’
AGAINSTC
Mc,a
…
.
H
LM
1) How to choose the most relevant paths to undertake in the reformulation task?2) Are there other peers in the network which can be contacted?
Acquaintances• Gossiping acquaintances– Potential friends that dynamically appears in the local
semantic view (LSV)• Mapping acquaintance– There is at least 1 direct mapping towards it (friend)– Established manually
• Social acquaintance (FOAF friend)– No direct mapping is needed towards it– There are some common interests– Established explicitly
Our Approach
• Gossip to disseminate mapping rules information to find friends
• Users topics of intrest – are expressed according to the schema information or
past queries topics• Measure to– Compute the relevance of a mapping wrt to a query– Compute similarity between users
• Exploits recursively (to translate a query)– Mapping acquaintances– Social acquaintances
Gossiping Acquaintances
Social Acquaintances• Friend
– Share common topics of – interests
• Interests– Formulated by queries– Elements of peer’s schema
• Approach: use the semantic view to discover friends
?= Hospital(x, “San Francisco”)
?= State( y, z, “California”)
?= Doctor( w, k)
?= Patology(“heart”, x)
… … …
Schema__________
Compute RelevanceGoal: Given an Query and a mapping rule, determine if
the mapping is relevant to the queryMethod (Standard Match Semantics)– Atom Label matching– Parameters compatibility
M1 Hospital(x, y) AND State (x,z) ⇢ HealthCareInst(x, y, z)
?= Hospital(x, “San Francisco”)
M2 Hospital(x, y,w) AND State (x,z) ⇢ HealthCareInst(x, y, z)
M3 Ospedale(x,y) AND State (x,z) ⇢ HealthCareInst(x, y, z)
Compute Relevance• AF-IMF Measure, inspired by TF-IDF*• AF (Atom Frequency)– Local measure, establishing the importance of the
query atom in the current mapping
• IMF (Inverse Mapping Frequency)– Distributed measure, establishing the overall
importance of the query atom• Relevance of a mapping wrt to q is AF * IMF
*term frequency-inverse document frequency
Compute Relevance (AF) About the applied measure◦ To increase the effectiveness of the measure we distinguish,
again, Forward/Backward relevance
M1 Hospital(x, y) AND State (x,z) ⇢ HealthCareInst(x, y, z)
body
head
M2 Institution(x, y, z) ⇢ Hospital(x, y)
?= Hospital(x, “San Francisco”)
FORWARD MEASURE
BACKWARD MEASURE
AF = 1/2
AF = 1
Compute Relevance (IMF)
• IMF requires a way to get a value for – The total number of mappings – The total number of mappings containing that
atom
• To do that, we can inspect the semantic view of the peer– Also by sending inquiries to peers in the FOAF
Translate-Query
• Compute Relevance on Local Mappings wrt Q– Choose the TopK Mappings– Apply the translation semantics, along/against the
mapping direction– Trigger Translate-Query on the mapping
acquaintance, recursively (until TTL)• Select FOAF friends to be contacted– By looking at the best Mapping summaries wrt Q– Trigger query Translate-Query on the social
acquaintance, recursively (until TTL)
Performance Evaluation
• Baseline– No gossiping, original query propagated
• Baseline+– No gossiping, translated query
propagated• Baseline#
– No gossiping, translated query propagated, local measure to sort mappings (by using AF only)
• Full-– Gossiping, translated query propagated,
AF-IMF measure to sort mappings, no FOAF links (only local mappings)
• Full (P2PRec)– Gossiping, translated query propagated,
AF-IMF measure to sort mappings, FOAF links exploited
TOPKM-1 TOPKM-5 TOPKM-10
TOPKM-15
TOPKM-20
TOPKM-25
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
RecallBaseline Baseline+ Baseline#Full- Full
TopK Mapping ThresholdRe
call
Effectiveness of AF-IMF, LSV and gossiping
Conclusions• P2P Query Reformulation– Gossiping is used to disseminated mappings rules
information– Exploits recursively relevant mappings • Mapping acquaintances• Social acquaintances
– Initial Perf. Resuts:• Very good recall results (over 90%)• Linear scale-up• Trade-off of Recall and Responses Times
– Previous work uses• DHTs or a centralized mediation model.
About Montpellier
Best quality of life in FranceImportant laboratories (LIRMM) and research instituts (INRA, CIRAD, etc)University of Montpellier is part of the « opération campus »Soon we will have a direct TGV line to Barcelona (1 hour)