36005323-p2p

8/9/2019 36005323-p2p

1/5

P2P-Join: A Keyword Based Join Operation

in Relational Database Enabled Peer-to-Peer Systems

Zhigang Chen, Zhongding Huang

Shanghai Second Polytechinc University

No.2360 Jinhai Road, Shanghai, 201209

{chen zhigang, huangzd}@sohu.com

Bo Ling, Jiang Li

China Executive Leadership Academy, Pudong

No.99 Qiancheng Road, Shanghai, 201204

{bling, jli}@celap.org.cn

Abstract

Query-by-keywords is the most popular manner to

search for data in this computing age. However, most

work was proposed for searching centralized relational

databases. This paper investigates how keyword search to

be deployed in Relational database-enabled peer-to-peer

systems. Unlike centralized system, the key challenges in

this new computing paradigm come from the autonomy of

peers, the lack of a global schema, and the dynamics of the

peer connectivity. First, the concept of P2P-Join is pro-

posed, which is a join operation to combine tuples among

relations from different peers containing certain keywords

in the query. Second, a fully distributed framework to real-

ize P2P-Join processing is devised, which not only inherits

the syntax and semantics of traditional join but also cher-

ishes the ideology of peer-to-peer computing. Finally, two

mechanisms are proposed to improve the performance of the

operation: a join path selecting order scheme and a push-based load balancing mechanism across peers.

1. Introduction

Peer-to-peer (P2P) technology, also called peer comput-

ing, is an emerging paradigm that is now viewed as a po-

tential technology that could re-construct distributed archi-

tectures (e.g., the Internet). In a P2P distributed system, a

large number of nodes (e.g., PCs connected to the Internet)

can potentially be pooled together to share their resources,

information and services. These nodes, which can both con-

sume as well as provide data and/or services, may join andleave theP2P network at any time, resulting in a truly dy-

namic and ad-hoc environment. The natures of such a de-

sign provides exciting opportunities for new applications.

While data sharing is the dominant application in P2P

computing at present, only file-system-like capabilities are

provided, the semantics of data is largely ignored. In some

unstructured systems (e.g., Gnutella[4]), searching is re-

stricted to strings contained in a filename and directory path.

Structured systems (e.g., Chord [20], CAN [17], and Pastry

[18]) only support exact name match file searching. In

a word, existing P2P systems support only semantics-free,large-granularity level data sharing, while lack of data man-

agement capabilities and support for semantic search. This

is because of P2P systems lacking the consideration of se-

mantics, data transformation and data relationships [5].

Recently, many researchers manage to integrate P2P

with database technologies into a highly distributed data

sharing environment [5, 7, 10, 11, 12, 15, 16]. Such sort of

systems, with autonomous relational database being peers,

are referred as relational database-enabled P2P systems

(PeerDBS). In this paper, a novel operation is introduced

into such type of systems to mining knowledge from P2P

networks. Specifically, the main contributions of this paper

are as follows:

The concept of P2P-Join operation is proposed, which joins tuples among relations containing certain key-

words in the query from different peers;

Second, a fully distributed framework to realize P2P-Join operation is devised, which not only inherits the

syntax and semantics of traditional join operation, but

also cherishes the ideology of P2P computing;

Finally, two mechanisms are proposed to improve theperformance of the operation: a join path selection

scheme, and a push-based load balancing mechanism

across peers.

The rest of the paper is organized as follows. Section 2

reviews some related work. Section 3 states the problem

and gives definitions. Section 4 presents the framework of

P2P-Join. Section 5 discusses the heuristics to improve the

performance and section 6 makes a conclusion.

Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 2006

Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.

8/9/2019 36005323-p2p

2/5

2 Related Work

P2P technologies have been deployed in many appli-

cations, such as instant message (IM) [13], collaborative

workgroup tool [6], CPU cycle sharing [19] and data shar-

ing [3, 4, 15]. While most of the applications are on data

sharing, current mechanisms are largely restricted to filelevel sharing without capabilities of relational database.

In [5], the issues of data management in P2P environ-

ment are discussed from a database perspective. Its focus is

largely on what database technologies can do for P2P ap-

plications. Though a preliminary architecture for peer data

management (Piazza) is described, little is discussed about

how peers cooperate. Different from [5], PeerOLAP [10]

sought to address the problem in a different way - it looks

at what P2P technologies can do for database applications.

Essentially, PeerOLAP is still a client/server system. How-

ever, the cooperation among clients (peers) is explored: all

data within clients is shared together.

Since each peer maintains its data independently, toachieve semantically meaningful search, some kind of un-

derstanding among peers is required. In PeerDB[15],

an IR-based approach is used to mediate peer schemas

with a global thesaurus being assumed. In [16], a data

model(LRM) is designed for P2P systems with domain re-

lations and coordination formulas to describe relationships

between two peer databases. Data Mapping[12] is a simpli-

fied implementation of this model.

Some contribution has also been made by data integra-

tion researchers. However, unlike traditional data integra-

tion systems, where a global schema is assumed with a few

data sources, a P2P system cannot simply assume a global

schema, due to its high dynamicity and a large number of

data sources. Nevertheless, it may be possible to composemediators, having some mediators defined in terms of other

mediators and data sources, thus archiving system extensi-

bility. This is the main thrust of [7] and [11]. While [7]

focuses on how queries are reformulated with views, [11]

focuses on selective view expansion with considering full

view expansion may be prohibitively expensive. Similar as

[7] and [11], our work is also mainly interested in query

processing aspects in peer database systems, however, we

make little assumption about peer availability, that means,

our work can be applied into a more general P2P environ-

ment.

Exploiting keyword search in database querying has also

drawn some attention recently [1, 2, 9]. However, theseworks only focus on centralized system, where the seman-

tics across different relations (e.g., key/foreign-key rela-

tionships) can be exploited to improve the search accuracy.

Such semantics are much harder in P2P systems, which is

the main challenge of our work.

3 Problem Statement

3.1 An Overview of Relational Database-enabled Peer-to-Peer Systems

For generality, our work is based on common P2P proto-

cols. Such a system has following natures:

A system is decentralized and self-organized, while itspeers are autonomous and dynamic;

Information is distributed among peers but not concen-trated at dedicated servers;

Peers are equivalent in functionality and responsibilityand interact with each other symmetrically.

To further exploit the merits of P2P protocols, we assume

a three layer-based architecture for a relational database-

enabled P2P system. Specifically, from bottom to up,

they are structured layer, unstructured layer and applica-

tion layer. The structured layer employs the protocol of

structured P2P systems (e.g., Chord [20] and CAN [17]),

which takes the DHT [8] mechanism to manage meta-dataof peers. The unstructured layer is implemented upon the

protocol of unstructured P2P systems (e.g., BestPeer [14]),

so that a peer in the system can be autonomous and able to

re-configures its neighbors dynamically. While the applica-

tion layer is of a relational database system.

3.2 Query and Answer

A query is modelled as a set of keywords, i.e., q =

k1, , kt, qid. Here, k1, , kt is a set of keywordsto semantically describe users desiring; while the qid is

a system-widely unique identifier for the query q gener-

ated by its initiator. This query style is chosen mainly be-

cause theres no global schema in P2P database systems,

and moreover, it is user-friendly.

Each peer maintains its data in a relational database.

When a query is processed, the peer searches its own

database and return tuples containing all or a part of key-

words in the query. Keyword searching in local relational

databases can be done by DISCOVER [9] or other methods.

Our focus is not in local processing, but in how to integrate

tuples from different peers.

The results of a local query can be of two forms: either

a tuple in a relation contains all keywords in the query or a

tuple only contains a subset of keywords in the query. For

the latter, the final resultant answer that contains all key-

words in the query may be obtained by joining tuples frommultiple relations. In addition, we expect some answers to

be obtainable from individual peers (even those that involve

joins), while others require the cooperation among peers.

We refer to the latter category as P2P-Join, which we are

most interested in and will be defined next.



8/9/2019 36005323-p2p

3/5

3.3 Peer-to-Peer Join

Definition 1 P2P-Join is a join operation that combines

two (or more) relations from two (or more) peers based on

the semantics of keywords and syntax of join operation of

relational database systems.

For example, consider a query Q={k1, k2}. Supposethere are two peers P1 and P2, such that peer P1 maintains

a Relation R1(A1,. . ., B, . . .) and peer P2 also maintains

a relation R2(A2, . . ., B, . . .), and both relations share a

common attribute B. Furthermore, if some of the values of

attribute A1 is k1 while some values of attribute A2 is k2.

In addition, if there exist a tuple < k1,. . ., b,. . ., x > in R1

and a tuple < k2,. . .,b, . . ., y > then a P2P join operation

can be performed and the result is < k1, b, k2 ,. . ., x , y >.

From the above example, we present a theorem for P2P-

Join, which can be easily proved.

Theorem 1 Two tuples in different peers are joined by a

operation of P2P join if and only if their common attribute

of two relation pertaining to two different peers has at least

one equal value, and furthermore, such a pair tuples con-

tain different keywords in the query Q.

4 Framework of P2P-Join

4.1 An Overview

In general, a query processing consists of six steps:

Query distribution, Local processing, Information ex-

change among peers, Join graph generation, P2P join, and

Result propagation.

When a query is submitted, it is distributed to all neigh-

bors. The neighbors will further forward the query to their

own neighbors, and so on, till the querys lifetime (Time-to-

Live, i.e., TTL) is expired.

For each peer who receives the query, it will first lookup

the full text index of its database to decide whether it con-

tains some or all of the keywords in the query. Based on

keyword searching methods in databases, tuples that con-

tain all keywords will be returned to the requesting peer di-

rectly, while tuples including partial keywords will be used

for peer-to-peer join. At present, there are many approaches

to support local keyword-based query processing, such as

DISCOVER[9], BANKS[2], DBXplorer[1].

Based on the results of local processing, each peer ex-

changes information with its neighbors, which are key-words in the tuples that only include subset of keywords in

the query. After having the information of its neighbors, a

peer will generate a join graph, where the vertices are peers,

while edges connect the pairs of peers that should be joined.

how a join graph is generated is describe in the next section.

With the join graph, a peer can identify the peers with

which it can be joined, and try to perform a P2P join. For the

join results that contain all keywords in the query can now

be directly returned to the requesting peer, while the results

containing partial keywords will be propagated along the

join paths until the final results including all keywords in

the query are generated and then returned to the user.

4.2 The Generation of Join Graph

A join graph is a graph G (V, E), in which V is the vertex

set, and E is the edge set. In such a graph, each vertex de-

notes a peer with one keyword in the query Q. According to

this strategy, one peer may be denoted by several different

vertices if it contains several keywords in the query. Two

peers are connected by an edge if and only if they contain

two different keywords in the query and they share certain

common attributes with the same meaning. A peers ver-

tices may be connected if it contains more than one keyword

in the query and they can be joined locally.

After processing the query locally, each peer sends thekeywords in the query that it contains to all its neighbors

who are also accessed by the query with the same query

identity. Further, when a peer receives the keywords from

one of its neighbor, it compares them with those in the query

that appears in its own local database. If there exists one

or more keywords in local database that are different with

those are in the neighbor candidate, it will establish a con-

nection with that peer. With these operations, the join graph

to a query is thus generated.

From the above, we can see that only peers connected by

an edge in the join graph need to be joined.

4.3 Peer-to-Peer Join

First, we consider the problem of P2P join between two

relations from two different peers. For example, Relation

R1(A1,...B,...) and Relation R2(A2,...B,...) are two rela-

tional tables belonging to two different peers P1 and P2.

According to the definition of P2P join, obviously, the tu-

ples whose values of attribute A1 or A2 are not k1 or k2 can

be filtered out directly, while other tuples are reserved. Cer-

tainly, we would like the filtering operation to be performed

locally first, since this will reduce the bandwidth cost, and

distribute the processing load on different peers, thus im-

proving query processing performance to some extent.

We can further improve performance by some optimiza-

tion techniques employed in RDBMS, such as semi-join.

4.4 The propagation of Results

A join path is a path in the join graph, in which each

keyword in the query appears exactly once in the vertices. A



8/9/2019 36005323-p2p

4/5

final result can be only obtained after all join operations on

the edges in a join path have been executed. Furthermore,

when all join paths have been traversed, we suppose, the

complete answer set to a given query is obtained. Therefore,

to obtain the results of a query, each join path should be

traversed, which is described by the following procedure.

BEGINFor each join path

For each edge connects Peer p and q

P2P-Join p and q

Reset neighbors and keywords

Delete edge(p,q)

Once a path traversed,

send results to requesting peer

END

Note that the neighborhood relationship is changing with

the join process going on. Furthermore, the process will

traverse all the join paths and the complete answer set to the

given query is obtained when it is finished. Compared tothe traditional processing, the above traversal procedure in

a P2P system is fully distributed.

The above algorithm is similar to the problem of graph

traversing, which has been proved to be cost-exponential.

However, with fully distributed approach, the situation is

different. Here, we only perform the worse case time anal-

ysis. Suppose in a P2P system, n peers have been accessed

by the query, and the maximally expected neighbor number

of a peer be k. Obviously, k n. Then the internal loop isexecuted at most O(k)m!. Here, m is the number of the key-

words in the query. Therefore, the total execution time takes

in the worst case is O(knm!). Since k and m is much smaller

number, the time complexity of the algorithm is much less

than n2.

5 Improvements

In this section, two heuristics are presented to further im-

prove the efficiency of query processing in Peer database

systems. First, one heuristic is presented to deal with the

issue of the order of join path selection. Second, anther

heuristic is proposed to balance the load among peers.

5.1 Join Path Selection

In the previous section, we assume P2P join is processed

in a local manner, that is, the join operation between eachpair of peers does not affect other peers. However, some

peers may have more neighbors while others have fewer,

so that the relative importance of edges among peers in a

join graph or subgraph is different. Therefore, different se-

quences of the join operation along the edges on the join

3

5

4

2

1

Figure 1. An example of Join Graph

path will have different cost, thus providing some opportu-

nities for further optimization.

In the context of database, it is widely accepted that shar-

ing common computation can be always beneficial. For ex-

ample, if we need to process both (A B) and (A B

C), we can process (A B)first and store its result for

later use (processing (A B C)). Here, we will use this

heuristic to direct the join sequence choosing.Using the above join graph as an example, we are now

illustrating how the heuristic can be used to help reducing

computation cost. We consider how different order of oper-

ation of edges affects the reuse of edge 3 in the join graph,

which imply the efficiency of the utilization of the resources

in relational database systems.

1. If the join operation on edge 3 is executed first, the

partial result can be taken advantage of by further join

operations (1,3), (1,3,4), (1,3,5), (2,3), (2,3,4) (2,3,5),

so that the sum of reuse time of edge 3 is 6.

2. However, if join operation on edge 1 (or 2) is executed

first, the partial result of edge 3 can be only taken ad-vantage by (1,3,4) and (1,3,5) (or, (2,3,4) and (2,3,5)).

Thus, the sum of reuse times of edge 3 is only 2.

3. In summary, the sum of reuse times of edge in case (1)

is three times (6/2=3) of that in case (2), which implies

that the join order greatly effects the cost and efficiency

of the whole P2P join processing.

As shown above, the sequence of join processing along

the join path is a key factor for the query processing per-

formance in P2P. However, this heuristic may be difficult to

implement, since it is not easy to synchronize peers. Addi-

tionally, it is expensive to get the complete graph and ana-

lyze globally. We design a simple mechanism to approxi-mate this heuristic: Before join, each peer sends the num-

ber of neighbors to its neighbors. A peer only agrees to join

with half of its neighbors who have more neighbors than

other half of neighbors. Only when both of two peers agree

to join, a join operation can be executed.



8/9/2019 36005323-p2p

5/5

5.2 Push-based Balancing

After several iterations of the join operations, the peers

which have many neighbors can be very busy, which has

been observed in our experiments. To achieve better load

balancing, we propose another heuristic: Each peer can set

a threshold based on its processing power to denote how

many join operations it can process simultaneously. If the

number of join processing a peer has exceeded the thresh-

old, it can broadcast to its neighbors to let other peers exe-

cute the join operation instead.

Two cases of this heuristic exist: If one of its neighbors

takes the join task, only the join plan, which will put the

result to the other endpoint of the connection will be con-

sidered. Otherwise, the peers that should be joined will

send their data to a third peer, which will take over the join

operation. Note that, there currently exist many join algo-

rithms, such as pipeline join or RIPPLE join, can execute

such tasks.

5.3 Implementation

A prototype with P2P-Join operation has been built upon

BestPeer [14], a generic P2P platform on which P2P ap-

plications can be developed efficiently. BestPeer integrates

mobile agent and P2P techniques together. While P2P pro-

vides resource sharing amongst nodes, mobile agents ex-

tends functions, including P2P-Join operation. In addition,

peers in BestPeer can dynamically reconfigure their neigh-

bor candidates. Further, Chord [20] is employed to map

meta-data (e.g., key, foreign key) among peers.

An experimental study has been conducted upon the pro-

totype, and the primary results are promising. Furthermore,

with the two heuristics being implemented, the performance

is greatly improved compared with original proposal.

6 Conclusion

This paper managed to deploy relational database opera-

tion upon P2P computing. First, the concept of P2P-Join is

proposed, which can combine tuples among relations from

different peers containing certain keywords in the query.

Further, a fully distributed method to realize P2P-Join pro-

cessing is devised, which inherits the syntax and semantics

of traditional join and cherishes the ideology of P2P as well.

Finally, two enhancements are proposed to improve the per-formance of the proposed P2P-Join operation. Since rela-

tional database-enabled operation in P2P computing is still

at its infant stage, some other issues need to be addressed,

e.g. network optimization and cache management, which

are the topics of our future research.

References

[1] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A sys-

tem for keyword-based search over relational databases. In

Proceedings of the 18th ICDE, CA, April 2002.

[2] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti, and S. Su-

darshan. Keyword searching and browsing in databases us-

ing banks. In Proceedings of the 18th ICDE, CA, April 2002.[3] P. Druschel. and A. Rowstron. Past: Persistent and anony-

mous storage in a peer-to-peer networking environment. In

Proceedings of the 8th IEEE Workshop on HotOS, 2001.

[4] Gnutella Homepage. http://gnutella.wego.com/.

[5] S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu.

What can databases do for peer-to-peer. In WebDB, 2001.

[6] Groove Home Page. http://www.groove.net.

[7] A. Y. Halevy, Z. G. Ives, and D. Suciu. Schema mediation in

peer data management systems. In Proceedings of the 19th

ICDE, 2003.

[8] M. Harren, J. Hellerstein, R. Huebsch, B. Loo, S. Shenker,

and I. Stoica. Complex queries in dht-based peer-to-peer net-

works. In IPTPS02, 2002.

[9] V. Hristidis and Y. Papakonstantinou. Discover: Keyword

search in relational databases. In VLDB2002, 2002.

[10] P. Kalnis, B. C. Ooi, W. S. Ng, D. Papadias, and K. L. Tan.

An adaptive peer-to-peer network for distributed caching of

olap results. In ACM SIGMOD, 2002.

[11] T. Katchaounov. Query processing in self-profiling compos-

able peer-to-peer mediator databases. In Proc. EDBT Ph.D.

Workshop 2002, 2002.

[12] A. Kementsietsidis, M. Arenas, and R. Miller. Data mapping

in peer-to-peer systems. In Proceedings of the 19th ICDE,

2003 (Poster Paper).

[13] MSN Home Page. http://www.msn.com/.

[14] W. S. Ng, B. C. Ooi, and K. L. Tan. Bestpeer: A self-

configurable peer-to-peer system. In Proceedings of the 18th

ICDE, San Jose, CA, April 2002 (Poster Paper).

[15] W. S. Ng, B. C. Ooi, K. L. Tan, and A. Zhou. Peerdb: A p2p-

based system for distributed data sharing. In Proceedings of

the 19th ICDE, 2003.

[16] A. B. Philip, G. Fausto, K. Anastasios, M. John, S. Luciano,

and Z. Ilya. Data management for peer-to-peer computing:

A vision. In WebDB Workshop, 2002.

[17] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and

S. Shenker. A scalable content-addressable network. In Pro-

ceedings of SIGCOMM, 2001.

[18] A. Rowstron and P. Druschel. Pastry: Scalable, distributed

object location and routing for large-scale peer-to-peer sys-

tems. In Proceedings of the International Conference on

Distributed Systems Platforms (Middleware), Germany, Nov.

2001.

[19] Seti@home Home Page. http://setiathome.ssl.berkely.edu/.

[20] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakr-

ishnan. Chord: A scalable peer-to-peer lookup service for

internet applications. In Proceedings of SIGCOMM, 2001.


Authorized licensed use limited to: Maharashtra Institute of Technology Downloaded on August 16 2010 at 11:49:18 UTC from IEEE Xplore Restrictions apply

36005323-p2p

Documents

Transcript of 36005323-p2p