36005323-p2p
-
Upload
vankapratyusha -
Category
Documents
-
view
215 -
download
0
Transcript of 36005323-p2p
-
8/9/2019 36005323-p2p
1/5
P2P-Join: A Keyword Based Join Operation
in Relational Database Enabled Peer-to-Peer Systems
Zhigang Chen, Zhongding Huang
Shanghai Second Polytechinc University
No.2360 Jinhai Road, Shanghai, 201209
{chen zhigang, huangzd}@sohu.com
Bo Ling, Jiang Li
China Executive Leadership Academy, Pudong
No.99 Qiancheng Road, Shanghai, 201204
{bling, jli}@celap.org.cn
Abstract
Query-by-keywords is the most popular manner to
search for data in this computing age. However, most
work was proposed for searching centralized relational
databases. This paper investigates how keyword search to
be deployed in Relational database-enabled peer-to-peer
systems. Unlike centralized system, the key challenges in
this new computing paradigm come from the autonomy of
peers, the lack of a global schema, and the dynamics of the
peer connectivity. First, the concept of P2P-Join is pro-
posed, which is a join operation to combine tuples among
relations from different peers containing certain keywords
in the query. Second, a fully distributed framework to real-
ize P2P-Join processing is devised, which not only inherits
the syntax and semantics of traditional join but also cher-
ishes the ideology of peer-to-peer computing. Finally, two
mechanisms are proposed to improve the performance of the
operation: a join path selecting order scheme and a push-based load balancing mechanism across peers.
1. Introduction
Peer-to-peer (P2P) technology, also called peer comput-
ing, is an emerging paradigm that is now viewed as a po-
tential technology that could re-construct distributed archi-
tectures (e.g., the Internet). In a P2P distributed system, a
large number of nodes (e.g., PCs connected to the Internet)
can potentially be pooled together to share their resources,
information and services. These nodes, which can both con-
sume as well as provide data and/or services, may join andleave theP2P network at any time, resulting in a truly dy-
namic and ad-hoc environment. The natures of such a de-
sign provides exciting opportunities for new applications.
While data sharing is the dominant application in P2P
computing at present, only file-system-like capabilities are
provided, the semantics of data is largely ignored. In some
unstructured systems (e.g., Gnutella[4]), searching is re-
stricted to strings contained in a filename and directory path.
Structured systems (e.g., Chord [20], CAN [17], and Pastry
[18]) only support exact name match file searching. In
a word, existing P2P systems support only semantics-free,large-granularity level data sharing, while lack of data man-
agement capabilities and support for semantic search. This
is because of P2P systems lacking the consideration of se-
mantics, data transformation and data relationships [5].
Recently, many researchers manage to integrate P2P
with database technologies into a highly distributed data
sharing environment [5, 7, 10, 11, 12, 15, 16]. Such sort of
systems, with autonomous relational database being peers,
are referred as relational database-enabled P2P systems
(PeerDBS). In this paper, a novel operation is introduced
into such type of systems to mining knowledge from P2P
networks. Specifically, the main contributions of this paper
are as follows:
The concept of P2P-Join operation is proposed, which joins tuples among relations containing certain key-
words in the query from different peers;
Second, a fully distributed framework to realize P2P-Join operation is devised, which not only inherits the
syntax and semantics of traditional join operation, but
also cherishes the ideology of P2P computing;
Finally, two mechanisms are proposed to improve theperformance of the operation: a join path selection
scheme, and a push-based load balancing mechanism
across peers.
The rest of the paper is organized as follows. Section 2
reviews some related work. Section 3 states the problem
and gives definitions. Section 4 presents the framework of
P2P-Join. Section 5 discusses the heuristics to improve the
performance and section 6 makes a conclusion.
Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 2006
Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
-
8/9/2019 36005323-p2p
2/5
2 Related Work
P2P technologies have been deployed in many appli-
cations, such as instant message (IM) [13], collaborative
workgroup tool [6], CPU cycle sharing [19] and data shar-
ing [3, 4, 15]. While most of the applications are on data
sharing, current mechanisms are largely restricted to filelevel sharing without capabilities of relational database.
In [5], the issues of data management in P2P environ-
ment are discussed from a database perspective. Its focus is
largely on what database technologies can do for P2P ap-
plications. Though a preliminary architecture for peer data
management (Piazza) is described, little is discussed about
how peers cooperate. Different from [5], PeerOLAP [10]
sought to address the problem in a different way - it looks
at what P2P technologies can do for database applications.
Essentially, PeerOLAP is still a client/server system. How-
ever, the cooperation among clients (peers) is explored: all
data within clients is shared together.
Since each peer maintains its data independently, toachieve semantically meaningful search, some kind of un-
derstanding among peers is required. In PeerDB[15],
an IR-based approach is used to mediate peer schemas
with a global thesaurus being assumed. In [16], a data
model(LRM) is designed for P2P systems with domain re-
lations and coordination formulas to describe relationships
between two peer databases. Data Mapping[12] is a simpli-
fied implementation of this model.
Some contribution has also been made by data integra-
tion researchers. However, unlike traditional data integra-
tion systems, where a global schema is assumed with a few
data sources, a P2P system cannot simply assume a global
schema, due to its high dynamicity and a large number of
data sources. Nevertheless, it may be possible to composemediators, having some mediators defined in terms of other
mediators and data sources, thus archiving system extensi-
bility. This is the main thrust of [7] and [11]. While [7]
focuses on how queries are reformulated with views, [11]
focuses on selective view expansion with considering full
view expansion may be prohibitively expensive. Similar as
[7] and [11], our work is also mainly interested in query
processing aspects in peer database systems, however, we
make little assumption about peer availability, that means,
our work can be applied into a more general P2P environ-
ment.
Exploiting keyword search in database querying has also
drawn some attention recently [1, 2, 9]. However, theseworks only focus on centralized system, where the seman-
tics across different relations (e.g., key/foreign-key rela-
tionships) can be exploited to improve the search accuracy.
Such semantics are much harder in P2P systems, which is
the main challenge of our work.
3 Problem Statement
3.1 An Overview of Relational Database-enabled Peer-to-Peer Systems
For generality, our work is based on common P2P proto-
cols. Such a system has following natures:
A system is decentralized and self-organized, while itspeers are autonomous and dynamic;
Information is distributed among peers but not concen-trated at dedicated servers;
Peers are equivalent in functionality and responsibilityand interact with each other symmetrically.
To further exploit the merits of P2P protocols, we assume
a three layer-based architecture for a relational database-
enabled P2P system. Specifically, from bottom to up,
they are structured layer, unstructured layer and applica-
tion layer. The structured layer employs the protocol of
structured P2P systems (e.g., Chord [20] and CAN [17]),
which takes the DHT [8] mechanism to manage meta-dataof peers. The unstructured layer is implemented upon the
protocol of unstructured P2P systems (e.g., BestPeer [14]),
so that a peer in the system can be autonomous and able to
re-configures its neighbors dynamically. While the applica-
tion layer is of a relational database system.
3.2 Query and Answer
A query is modelled as a set of keywords, i.e., q =
k1, , kt, qid. Here, k1, , kt is a set of keywordsto semantically describe users desiring; while the qid is
a system-widely unique identifier for the query q gener-
ated by its initiator. This query style is chosen mainly be-
cause theres no global schema in P2P database systems,
and moreover, it is user-friendly.
Each peer maintains its data in a relational database.
When a query is processed, the peer searches its own
database and return tuples containing all or a part of key-
words in the query. Keyword searching in local relational
databases can be done by DISCOVER [9] or other methods.
Our focus is not in local processing, but in how to integrate
tuples from different peers.
The results of a local query can be of two forms: either
a tuple in a relation contains all keywords in the query or a
tuple only contains a subset of keywords in the query. For
the latter, the final resultant answer that contains all key-
words in the query may be obtained by joining tuples frommultiple relations. In addition, we expect some answers to
be obtainable from individual peers (even those that involve
joins), while others require the cooperation among peers.
We refer to the latter category as P2P-Join, which we are
most interested in and will be defined next.
Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 2006
Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
-
8/9/2019 36005323-p2p
3/5
3.3 Peer-to-Peer Join
Definition 1 P2P-Join is a join operation that combines
two (or more) relations from two (or more) peers based on
the semantics of keywords and syntax of join operation of
relational database systems.
For example, consider a query Q={k1, k2}. Supposethere are two peers P1 and P2, such that peer P1 maintains
a Relation R1(A1,. . ., B, . . .) and peer P2 also maintains
a relation R2(A2, . . ., B, . . .), and both relations share a
common attribute B. Furthermore, if some of the values of
attribute A1 is k1 while some values of attribute A2 is k2.
In addition, if there exist a tuple < k1,. . ., b,. . ., x > in R1
and a tuple < k2,. . .,b, . . ., y > then a P2P join operation
can be performed and the result is < k1, b, k2 ,. . ., x , y >.
From the above example, we present a theorem for P2P-
Join, which can be easily proved.
Theorem 1 Two tuples in different peers are joined by a
operation of P2P join if and only if their common attribute
of two relation pertaining to two different peers has at least
one equal value, and furthermore, such a pair tuples con-
tain different keywords in the query Q.
4 Framework of P2P-Join
4.1 An Overview
In general, a query processing consists of six steps:
Query distribution, Local processing, Information ex-
change among peers, Join graph generation, P2P join, and
Result propagation.
When a query is submitted, it is distributed to all neigh-
bors. The neighbors will further forward the query to their
own neighbors, and so on, till the querys lifetime (Time-to-
Live, i.e., TTL) is expired.
For each peer who receives the query, it will first lookup
the full text index of its database to decide whether it con-
tains some or all of the keywords in the query. Based on
keyword searching methods in databases, tuples that con-
tain all keywords will be returned to the requesting peer di-
rectly, while tuples including partial keywords will be used
for peer-to-peer join. At present, there are many approaches
to support local keyword-based query processing, such as
DISCOVER[9], BANKS[2], DBXplorer[1].
Based on the results of local processing, each peer ex-
changes information with its neighbors, which are key-words in the tuples that only include subset of keywords in
the query. After having the information of its neighbors, a
peer will generate a join graph, where the vertices are peers,
while edges connect the pairs of peers that should be joined.
how a join graph is generated is describe in the next section.
With the join graph, a peer can identify the peers with
which it can be joined, and try to perform a P2P join. For the
join results that contain all keywords in the query can now
be directly returned to the requesting peer, while the results
containing partial keywords will be propagated along the
join paths until the final results including all keywords in
the query are generated and then returned to the user.
4.2 The Generation of Join Graph
A join graph is a graph G (V, E), in which V is the vertex
set, and E is the edge set. In such a graph, each vertex de-
notes a peer with one keyword in the query Q. According to
this strategy, one peer may be denoted by several different
vertices if it contains several keywords in the query. Two
peers are connected by an edge if and only if they contain
two different keywords in the query and they share certain
common attributes with the same meaning. A peers ver-
tices may be connected if it contains more than one keyword
in the query and they can be joined locally.
After processing the query locally, each peer sends thekeywords in the query that it contains to all its neighbors
who are also accessed by the query with the same query
identity. Further, when a peer receives the keywords from
one of its neighbor, it compares them with those in the query
that appears in its own local database. If there exists one
or more keywords in local database that are different with
those are in the neighbor candidate, it will establish a con-
nection with that peer. With these operations, the join graph
to a query is thus generated.
From the above, we can see that only peers connected by
an edge in the join graph need to be joined.
4.3 Peer-to-Peer Join
First, we consider the problem of P2P join between two
relations from two different peers. For example, Relation
R1(A1,...B,...) and Relation R2(A2,...B,...) are two rela-
tional tables belonging to two different peers P1 and P2.
According to the definition of P2P join, obviously, the tu-
ples whose values of attribute A1 or A2 are not k1 or k2 can
be filtered out directly, while other tuples are reserved. Cer-
tainly, we would like the filtering operation to be performed
locally first, since this will reduce the bandwidth cost, and
distribute the processing load on different peers, thus im-
proving query processing performance to some extent.
We can further improve performance by some optimiza-
tion techniques employed in RDBMS, such as semi-join.
4.4 The propagation of Results
A join path is a path in the join graph, in which each
keyword in the query appears exactly once in the vertices. A
Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 2006
Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
-
8/9/2019 36005323-p2p
4/5
final result can be only obtained after all join operations on
the edges in a join path have been executed. Furthermore,
when all join paths have been traversed, we suppose, the
complete answer set to a given query is obtained. Therefore,
to obtain the results of a query, each join path should be
traversed, which is described by the following procedure.
BEGINFor each join path
For each edge connects Peer p and q
P2P-Join p and q
Reset neighbors and keywords
Delete edge(p,q)
Once a path traversed,
send results to requesting peer
END
Note that the neighborhood relationship is changing with
the join process going on. Furthermore, the process will
traverse all the join paths and the complete answer set to the
given query is obtained when it is finished. Compared tothe traditional processing, the above traversal procedure in
a P2P system is fully distributed.
The above algorithm is similar to the problem of graph
traversing, which has been proved to be cost-exponential.
However, with fully distributed approach, the situation is
different. Here, we only perform the worse case time anal-
ysis. Suppose in a P2P system, n peers have been accessed
by the query, and the maximally expected neighbor number
of a peer be k. Obviously, k n. Then the internal loop isexecuted at most O(k)m!. Here, m is the number of the key-
words in the query. Therefore, the total execution time takes
in the worst case is O(knm!). Since k and m is much smaller
number, the time complexity of the algorithm is much less
than n2.
5 Improvements
In this section, two heuristics are presented to further im-
prove the efficiency of query processing in Peer database
systems. First, one heuristic is presented to deal with the
issue of the order of join path selection. Second, anther
heuristic is proposed to balance the load among peers.
5.1 Join Path Selection
In the previous section, we assume P2P join is processed
in a local manner, that is, the join operation between eachpair of peers does not affect other peers. However, some
peers may have more neighbors while others have fewer,
so that the relative importance of edges among peers in a
join graph or subgraph is different. Therefore, different se-
quences of the join operation along the edges on the join
3
5
4
2
1
Figure 1. An example of Join Graph
path will have different cost, thus providing some opportu-
nities for further optimization.
In the context of database, it is widely accepted that shar-
ing common computation can be always beneficial. For ex-
ample, if we need to process both (A B) and (A B
C), we can process (A B)first and store its result for
later use (processing (A B C)). Here, we will use this
heuristic to direct the join sequence choosing.Using the above join graph as an example, we are now
illustrating how the heuristic can be used to help reducing
computation cost. We consider how different order of oper-
ation of edges affects the reuse of edge 3 in the join graph,
which imply the efficiency of the utilization of the resources
in relational database systems.
1. If the join operation on edge 3 is executed first, the
partial result can be taken advantage of by further join
operations (1,3), (1,3,4), (1,3,5), (2,3), (2,3,4) (2,3,5),
so that the sum of reuse time of edge 3 is 6.
2. However, if join operation on edge 1 (or 2) is executed
first, the partial result of edge 3 can be only taken ad-vantage by (1,3,4) and (1,3,5) (or, (2,3,4) and (2,3,5)).
Thus, the sum of reuse times of edge 3 is only 2.
3. In summary, the sum of reuse times of edge in case (1)
is three times (6/2=3) of that in case (2), which implies
that the join order greatly effects the cost and efficiency
of the whole P2P join processing.
As shown above, the sequence of join processing along
the join path is a key factor for the query processing per-
formance in P2P. However, this heuristic may be difficult to
implement, since it is not easy to synchronize peers. Addi-
tionally, it is expensive to get the complete graph and ana-
lyze globally. We design a simple mechanism to approxi-mate this heuristic: Before join, each peer sends the num-
ber of neighbors to its neighbors. A peer only agrees to join
with half of its neighbors who have more neighbors than
other half of neighbors. Only when both of two peers agree
to join, a join operation can be executed.
Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 2006
Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.
-
8/9/2019 36005323-p2p
5/5
5.2 Push-based Balancing
After several iterations of the join operations, the peers
which have many neighbors can be very busy, which has
been observed in our experiments. To achieve better load
balancing, we propose another heuristic: Each peer can set
a threshold based on its processing power to denote how
many join operations it can process simultaneously. If the
number of join processing a peer has exceeded the thresh-
old, it can broadcast to its neighbors to let other peers exe-
cute the join operation instead.
Two cases of this heuristic exist: If one of its neighbors
takes the join task, only the join plan, which will put the
result to the other endpoint of the connection will be con-
sidered. Otherwise, the peers that should be joined will
send their data to a third peer, which will take over the join
operation. Note that, there currently exist many join algo-
rithms, such as pipeline join or RIPPLE join, can execute
such tasks.
5.3 Implementation
A prototype with P2P-Join operation has been built upon
BestPeer [14], a generic P2P platform on which P2P ap-
plications can be developed efficiently. BestPeer integrates
mobile agent and P2P techniques together. While P2P pro-
vides resource sharing amongst nodes, mobile agents ex-
tends functions, including P2P-Join operation. In addition,
peers in BestPeer can dynamically reconfigure their neigh-
bor candidates. Further, Chord [20] is employed to map
meta-data (e.g., key, foreign key) among peers.
An experimental study has been conducted upon the pro-
totype, and the primary results are promising. Furthermore,
with the two heuristics being implemented, the performance
is greatly improved compared with original proposal.
6 Conclusion
This paper managed to deploy relational database opera-
tion upon P2P computing. First, the concept of P2P-Join is
proposed, which can combine tuples among relations from
different peers containing certain keywords in the query.
Further, a fully distributed method to realize P2P-Join pro-
cessing is devised, which inherits the syntax and semantics
of traditional join and cherishes the ideology of P2P as well.
Finally, two enhancements are proposed to improve the per-formance of the proposed P2P-Join operation. Since rela-
tional database-enabled operation in P2P computing is still
at its infant stage, some other issues need to be addressed,
e.g. network optimization and cache management, which
are the topics of our future research.
References
[1] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A sys-
tem for keyword-based search over relational databases. In
Proceedings of the 18th ICDE, CA, April 2002.
[2] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti, and S. Su-
darshan. Keyword searching and browsing in databases us-
ing banks. In Proceedings of the 18th ICDE, CA, April 2002.[3] P. Druschel. and A. Rowstron. Past: Persistent and anony-
mous storage in a peer-to-peer networking environment. In
Proceedings of the 8th IEEE Workshop on HotOS, 2001.
[4] Gnutella Homepage. http://gnutella.wego.com/.
[5] S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu.
What can databases do for peer-to-peer. In WebDB, 2001.
[6] Groove Home Page. http://www.groove.net.
[7] A. Y. Halevy, Z. G. Ives, and D. Suciu. Schema mediation in
peer data management systems. In Proceedings of the 19th
ICDE, 2003.
[8] M. Harren, J. Hellerstein, R. Huebsch, B. Loo, S. Shenker,
and I. Stoica. Complex queries in dht-based peer-to-peer net-
works. In IPTPS02, 2002.
[9] V. Hristidis and Y. Papakonstantinou. Discover: Keyword
search in relational databases. In VLDB2002, 2002.
[10] P. Kalnis, B. C. Ooi, W. S. Ng, D. Papadias, and K. L. Tan.
An adaptive peer-to-peer network for distributed caching of
olap results. In ACM SIGMOD, 2002.
[11] T. Katchaounov. Query processing in self-profiling compos-
able peer-to-peer mediator databases. In Proc. EDBT Ph.D.
Workshop 2002, 2002.
[12] A. Kementsietsidis, M. Arenas, and R. Miller. Data mapping
in peer-to-peer systems. In Proceedings of the 19th ICDE,
2003 (Poster Paper).
[13] MSN Home Page. http://www.msn.com/.
[14] W. S. Ng, B. C. Ooi, and K. L. Tan. Bestpeer: A self-
configurable peer-to-peer system. In Proceedings of the 18th
ICDE, San Jose, CA, April 2002 (Poster Paper).
[15] W. S. Ng, B. C. Ooi, K. L. Tan, and A. Zhou. Peerdb: A p2p-
based system for distributed data sharing. In Proceedings of
the 19th ICDE, 2003.
[16] A. B. Philip, G. Fausto, K. Anastasios, M. John, S. Luciano,
and Z. Ilya. Data management for peer-to-peer computing:
A vision. In WebDB Workshop, 2002.
[17] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and
S. Shenker. A scalable content-addressable network. In Pro-
ceedings of SIGCOMM, 2001.
[18] A. Rowstron and P. Druschel. Pastry: Scalable, distributed
object location and routing for large-scale peer-to-peer sys-
tems. In Proceedings of the International Conference on
Distributed Systems Platforms (Middleware), Germany, Nov.
2001.
[19] Seti@home Home Page. http://setiathome.ssl.berkely.edu/.
[20] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakr-
ishnan. Chord: A scalable peer-to-peer lookup service for
internet applications. In Proceedings of SIGCOMM, 2001.
Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 2006
Authorized licensed use limited to: Maharashtra Institute of Technology Downloaded on August 16 2010 at 11:49:18 UTC from IEEE Xplore Restrictions apply