36005323-p2p

download 36005323-p2p

of 5

Transcript of 36005323-p2p

  • 8/9/2019 36005323-p2p

    1/5

    P2P-Join: A Keyword Based Join Operation

    in Relational Database Enabled Peer-to-Peer Systems

    Zhigang Chen, Zhongding Huang

    Shanghai Second Polytechinc University

    No.2360 Jinhai Road, Shanghai, 201209

    {chen zhigang, huangzd}@sohu.com

    Bo Ling, Jiang Li

    China Executive Leadership Academy, Pudong

    No.99 Qiancheng Road, Shanghai, 201204

    {bling, jli}@celap.org.cn

    Abstract

    Query-by-keywords is the most popular manner to

    search for data in this computing age. However, most

    work was proposed for searching centralized relational

    databases. This paper investigates how keyword search to

    be deployed in Relational database-enabled peer-to-peer

    systems. Unlike centralized system, the key challenges in

    this new computing paradigm come from the autonomy of

    peers, the lack of a global schema, and the dynamics of the

    peer connectivity. First, the concept of P2P-Join is pro-

    posed, which is a join operation to combine tuples among

    relations from different peers containing certain keywords

    in the query. Second, a fully distributed framework to real-

    ize P2P-Join processing is devised, which not only inherits

    the syntax and semantics of traditional join but also cher-

    ishes the ideology of peer-to-peer computing. Finally, two

    mechanisms are proposed to improve the performance of the

    operation: a join path selecting order scheme and a push-based load balancing mechanism across peers.

    1. Introduction

    Peer-to-peer (P2P) technology, also called peer comput-

    ing, is an emerging paradigm that is now viewed as a po-

    tential technology that could re-construct distributed archi-

    tectures (e.g., the Internet). In a P2P distributed system, a

    large number of nodes (e.g., PCs connected to the Internet)

    can potentially be pooled together to share their resources,

    information and services. These nodes, which can both con-

    sume as well as provide data and/or services, may join andleave theP2P network at any time, resulting in a truly dy-

    namic and ad-hoc environment. The natures of such a de-

    sign provides exciting opportunities for new applications.

    While data sharing is the dominant application in P2P

    computing at present, only file-system-like capabilities are

    provided, the semantics of data is largely ignored. In some

    unstructured systems (e.g., Gnutella[4]), searching is re-

    stricted to strings contained in a filename and directory path.

    Structured systems (e.g., Chord [20], CAN [17], and Pastry

    [18]) only support exact name match file searching. In

    a word, existing P2P systems support only semantics-free,large-granularity level data sharing, while lack of data man-

    agement capabilities and support for semantic search. This

    is because of P2P systems lacking the consideration of se-

    mantics, data transformation and data relationships [5].

    Recently, many researchers manage to integrate P2P

    with database technologies into a highly distributed data

    sharing environment [5, 7, 10, 11, 12, 15, 16]. Such sort of

    systems, with autonomous relational database being peers,

    are referred as relational database-enabled P2P systems

    (PeerDBS). In this paper, a novel operation is introduced

    into such type of systems to mining knowledge from P2P

    networks. Specifically, the main contributions of this paper

    are as follows:

    The concept of P2P-Join operation is proposed, which joins tuples among relations containing certain key-

    words in the query from different peers;

    Second, a fully distributed framework to realize P2P-Join operation is devised, which not only inherits the

    syntax and semantics of traditional join operation, but

    also cherishes the ideology of P2P computing;

    Finally, two mechanisms are proposed to improve theperformance of the operation: a join path selection

    scheme, and a push-based load balancing mechanism

    across peers.

    The rest of the paper is organized as follows. Section 2

    reviews some related work. Section 3 states the problem

    and gives definitions. Section 4 presents the framework of

    P2P-Join. Section 5 discusses the heuristics to improve the

    performance and section 6 makes a conclusion.

    Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 2006

    Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.

  • 8/9/2019 36005323-p2p

    2/5

    2 Related Work

    P2P technologies have been deployed in many appli-

    cations, such as instant message (IM) [13], collaborative

    workgroup tool [6], CPU cycle sharing [19] and data shar-

    ing [3, 4, 15]. While most of the applications are on data

    sharing, current mechanisms are largely restricted to filelevel sharing without capabilities of relational database.

    In [5], the issues of data management in P2P environ-

    ment are discussed from a database perspective. Its focus is

    largely on what database technologies can do for P2P ap-

    plications. Though a preliminary architecture for peer data

    management (Piazza) is described, little is discussed about

    how peers cooperate. Different from [5], PeerOLAP [10]

    sought to address the problem in a different way - it looks

    at what P2P technologies can do for database applications.

    Essentially, PeerOLAP is still a client/server system. How-

    ever, the cooperation among clients (peers) is explored: all

    data within clients is shared together.

    Since each peer maintains its data independently, toachieve semantically meaningful search, some kind of un-

    derstanding among peers is required. In PeerDB[15],

    an IR-based approach is used to mediate peer schemas

    with a global thesaurus being assumed. In [16], a data

    model(LRM) is designed for P2P systems with domain re-

    lations and coordination formulas to describe relationships

    between two peer databases. Data Mapping[12] is a simpli-

    fied implementation of this model.

    Some contribution has also been made by data integra-

    tion researchers. However, unlike traditional data integra-

    tion systems, where a global schema is assumed with a few

    data sources, a P2P system cannot simply assume a global

    schema, due to its high dynamicity and a large number of

    data sources. Nevertheless, it may be possible to composemediators, having some mediators defined in terms of other

    mediators and data sources, thus archiving system extensi-

    bility. This is the main thrust of [7] and [11]. While [7]

    focuses on how queries are reformulated with views, [11]

    focuses on selective view expansion with considering full

    view expansion may be prohibitively expensive. Similar as

    [7] and [11], our work is also mainly interested in query

    processing aspects in peer database systems, however, we

    make little assumption about peer availability, that means,

    our work can be applied into a more general P2P environ-

    ment.

    Exploiting keyword search in database querying has also

    drawn some attention recently [1, 2, 9]. However, theseworks only focus on centralized system, where the seman-

    tics across different relations (e.g., key/foreign-key rela-

    tionships) can be exploited to improve the search accuracy.

    Such semantics are much harder in P2P systems, which is

    the main challenge of our work.

    3 Problem Statement

    3.1 An Overview of Relational Database-enabled Peer-to-Peer Systems

    For generality, our work is based on common P2P proto-

    cols. Such a system has following natures:

    A system is decentralized and self-organized, while itspeers are autonomous and dynamic;

    Information is distributed among peers but not concen-trated at dedicated servers;

    Peers are equivalent in functionality and responsibilityand interact with each other symmetrically.

    To further exploit the merits of P2P protocols, we assume

    a three layer-based architecture for a relational database-

    enabled P2P system. Specifically, from bottom to up,

    they are structured layer, unstructured layer and applica-

    tion layer. The structured layer employs the protocol of

    structured P2P systems (e.g., Chord [20] and CAN [17]),

    which takes the DHT [8] mechanism to manage meta-dataof peers. The unstructured layer is implemented upon the

    protocol of unstructured P2P systems (e.g., BestPeer [14]),

    so that a peer in the system can be autonomous and able to

    re-configures its neighbors dynamically. While the applica-

    tion layer is of a relational database system.

    3.2 Query and Answer

    A query is modelled as a set of keywords, i.e., q =

    k1, , kt, qid. Here, k1, , kt is a set of keywordsto semantically describe users desiring; while the qid is

    a system-widely unique identifier for the query q gener-

    ated by its initiator. This query style is chosen mainly be-

    cause theres no global schema in P2P database systems,

    and moreover, it is user-friendly.

    Each peer maintains its data in a relational database.

    When a query is processed, the peer searches its own

    database and return tuples containing all or a part of key-

    words in the query. Keyword searching in local relational

    databases can be done by DISCOVER [9] or other methods.

    Our focus is not in local processing, but in how to integrate

    tuples from different peers.

    The results of a local query can be of two forms: either

    a tuple in a relation contains all keywords in the query or a

    tuple only contains a subset of keywords in the query. For

    the latter, the final resultant answer that contains all key-

    words in the query may be obtained by joining tuples frommultiple relations. In addition, we expect some answers to

    be obtainable from individual peers (even those that involve

    joins), while others require the cooperation among peers.

    We refer to the latter category as P2P-Join, which we are

    most interested in and will be defined next.

    Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 2006

    Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.

  • 8/9/2019 36005323-p2p

    3/5

    3.3 Peer-to-Peer Join

    Definition 1 P2P-Join is a join operation that combines

    two (or more) relations from two (or more) peers based on

    the semantics of keywords and syntax of join operation of

    relational database systems.

    For example, consider a query Q={k1, k2}. Supposethere are two peers P1 and P2, such that peer P1 maintains

    a Relation R1(A1,. . ., B, . . .) and peer P2 also maintains

    a relation R2(A2, . . ., B, . . .), and both relations share a

    common attribute B. Furthermore, if some of the values of

    attribute A1 is k1 while some values of attribute A2 is k2.

    In addition, if there exist a tuple < k1,. . ., b,. . ., x > in R1

    and a tuple < k2,. . .,b, . . ., y > then a P2P join operation

    can be performed and the result is < k1, b, k2 ,. . ., x , y >.

    From the above example, we present a theorem for P2P-

    Join, which can be easily proved.

    Theorem 1 Two tuples in different peers are joined by a

    operation of P2P join if and only if their common attribute

    of two relation pertaining to two different peers has at least

    one equal value, and furthermore, such a pair tuples con-

    tain different keywords in the query Q.

    4 Framework of P2P-Join

    4.1 An Overview

    In general, a query processing consists of six steps:

    Query distribution, Local processing, Information ex-

    change among peers, Join graph generation, P2P join, and

    Result propagation.

    When a query is submitted, it is distributed to all neigh-

    bors. The neighbors will further forward the query to their

    own neighbors, and so on, till the querys lifetime (Time-to-

    Live, i.e., TTL) is expired.

    For each peer who receives the query, it will first lookup

    the full text index of its database to decide whether it con-

    tains some or all of the keywords in the query. Based on

    keyword searching methods in databases, tuples that con-

    tain all keywords will be returned to the requesting peer di-

    rectly, while tuples including partial keywords will be used

    for peer-to-peer join. At present, there are many approaches

    to support local keyword-based query processing, such as

    DISCOVER[9], BANKS[2], DBXplorer[1].

    Based on the results of local processing, each peer ex-

    changes information with its neighbors, which are key-words in the tuples that only include subset of keywords in

    the query. After having the information of its neighbors, a

    peer will generate a join graph, where the vertices are peers,

    while edges connect the pairs of peers that should be joined.

    how a join graph is generated is describe in the next section.

    With the join graph, a peer can identify the peers with

    which it can be joined, and try to perform a P2P join. For the

    join results that contain all keywords in the query can now

    be directly returned to the requesting peer, while the results

    containing partial keywords will be propagated along the

    join paths until the final results including all keywords in

    the query are generated and then returned to the user.

    4.2 The Generation of Join Graph

    A join graph is a graph G (V, E), in which V is the vertex

    set, and E is the edge set. In such a graph, each vertex de-

    notes a peer with one keyword in the query Q. According to

    this strategy, one peer may be denoted by several different

    vertices if it contains several keywords in the query. Two

    peers are connected by an edge if and only if they contain

    two different keywords in the query and they share certain

    common attributes with the same meaning. A peers ver-

    tices may be connected if it contains more than one keyword

    in the query and they can be joined locally.

    After processing the query locally, each peer sends thekeywords in the query that it contains to all its neighbors

    who are also accessed by the query with the same query

    identity. Further, when a peer receives the keywords from

    one of its neighbor, it compares them with those in the query

    that appears in its own local database. If there exists one

    or more keywords in local database that are different with

    those are in the neighbor candidate, it will establish a con-

    nection with that peer. With these operations, the join graph

    to a query is thus generated.

    From the above, we can see that only peers connected by

    an edge in the join graph need to be joined.

    4.3 Peer-to-Peer Join

    First, we consider the problem of P2P join between two

    relations from two different peers. For example, Relation

    R1(A1,...B,...) and Relation R2(A2,...B,...) are two rela-

    tional tables belonging to two different peers P1 and P2.

    According to the definition of P2P join, obviously, the tu-

    ples whose values of attribute A1 or A2 are not k1 or k2 can

    be filtered out directly, while other tuples are reserved. Cer-

    tainly, we would like the filtering operation to be performed

    locally first, since this will reduce the bandwidth cost, and

    distribute the processing load on different peers, thus im-

    proving query processing performance to some extent.

    We can further improve performance by some optimiza-

    tion techniques employed in RDBMS, such as semi-join.

    4.4 The propagation of Results

    A join path is a path in the join graph, in which each

    keyword in the query appears exactly once in the vertices. A

    Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 2006

    Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.

  • 8/9/2019 36005323-p2p

    4/5

    final result can be only obtained after all join operations on

    the edges in a join path have been executed. Furthermore,

    when all join paths have been traversed, we suppose, the

    complete answer set to a given query is obtained. Therefore,

    to obtain the results of a query, each join path should be

    traversed, which is described by the following procedure.

    BEGINFor each join path

    For each edge connects Peer p and q

    P2P-Join p and q

    Reset neighbors and keywords

    Delete edge(p,q)

    Once a path traversed,

    send results to requesting peer

    END

    Note that the neighborhood relationship is changing with

    the join process going on. Furthermore, the process will

    traverse all the join paths and the complete answer set to the

    given query is obtained when it is finished. Compared tothe traditional processing, the above traversal procedure in

    a P2P system is fully distributed.

    The above algorithm is similar to the problem of graph

    traversing, which has been proved to be cost-exponential.

    However, with fully distributed approach, the situation is

    different. Here, we only perform the worse case time anal-

    ysis. Suppose in a P2P system, n peers have been accessed

    by the query, and the maximally expected neighbor number

    of a peer be k. Obviously, k n. Then the internal loop isexecuted at most O(k)m!. Here, m is the number of the key-

    words in the query. Therefore, the total execution time takes

    in the worst case is O(knm!). Since k and m is much smaller

    number, the time complexity of the algorithm is much less

    than n2.

    5 Improvements

    In this section, two heuristics are presented to further im-

    prove the efficiency of query processing in Peer database

    systems. First, one heuristic is presented to deal with the

    issue of the order of join path selection. Second, anther

    heuristic is proposed to balance the load among peers.

    5.1 Join Path Selection

    In the previous section, we assume P2P join is processed

    in a local manner, that is, the join operation between eachpair of peers does not affect other peers. However, some

    peers may have more neighbors while others have fewer,

    so that the relative importance of edges among peers in a

    join graph or subgraph is different. Therefore, different se-

    quences of the join operation along the edges on the join

    3

    5

    4

    2

    1

    Figure 1. An example of Join Graph

    path will have different cost, thus providing some opportu-

    nities for further optimization.

    In the context of database, it is widely accepted that shar-

    ing common computation can be always beneficial. For ex-

    ample, if we need to process both (A B) and (A B

    C), we can process (A B)first and store its result for

    later use (processing (A B C)). Here, we will use this

    heuristic to direct the join sequence choosing.Using the above join graph as an example, we are now

    illustrating how the heuristic can be used to help reducing

    computation cost. We consider how different order of oper-

    ation of edges affects the reuse of edge 3 in the join graph,

    which imply the efficiency of the utilization of the resources

    in relational database systems.

    1. If the join operation on edge 3 is executed first, the

    partial result can be taken advantage of by further join

    operations (1,3), (1,3,4), (1,3,5), (2,3), (2,3,4) (2,3,5),

    so that the sum of reuse time of edge 3 is 6.

    2. However, if join operation on edge 1 (or 2) is executed

    first, the partial result of edge 3 can be only taken ad-vantage by (1,3,4) and (1,3,5) (or, (2,3,4) and (2,3,5)).

    Thus, the sum of reuse times of edge 3 is only 2.

    3. In summary, the sum of reuse times of edge in case (1)

    is three times (6/2=3) of that in case (2), which implies

    that the join order greatly effects the cost and efficiency

    of the whole P2P join processing.

    As shown above, the sequence of join processing along

    the join path is a key factor for the query processing per-

    formance in P2P. However, this heuristic may be difficult to

    implement, since it is not easy to synchronize peers. Addi-

    tionally, it is expensive to get the complete graph and ana-

    lyze globally. We design a simple mechanism to approxi-mate this heuristic: Before join, each peer sends the num-

    ber of neighbors to its neighbors. A peer only agrees to join

    with half of its neighbors who have more neighbors than

    other half of neighbors. Only when both of two peers agree

    to join, a join operation can be executed.

    Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 2006

    Authorized licensed use limited to: Maharashtra Institute of Technology. Downloaded on August 16,2010 at 11:49:18 UTC from IEEE Xplore. Restrictions apply.

  • 8/9/2019 36005323-p2p

    5/5

    5.2 Push-based Balancing

    After several iterations of the join operations, the peers

    which have many neighbors can be very busy, which has

    been observed in our experiments. To achieve better load

    balancing, we propose another heuristic: Each peer can set

    a threshold based on its processing power to denote how

    many join operations it can process simultaneously. If the

    number of join processing a peer has exceeded the thresh-

    old, it can broadcast to its neighbors to let other peers exe-

    cute the join operation instead.

    Two cases of this heuristic exist: If one of its neighbors

    takes the join task, only the join plan, which will put the

    result to the other endpoint of the connection will be con-

    sidered. Otherwise, the peers that should be joined will

    send their data to a third peer, which will take over the join

    operation. Note that, there currently exist many join algo-

    rithms, such as pipeline join or RIPPLE join, can execute

    such tasks.

    5.3 Implementation

    A prototype with P2P-Join operation has been built upon

    BestPeer [14], a generic P2P platform on which P2P ap-

    plications can be developed efficiently. BestPeer integrates

    mobile agent and P2P techniques together. While P2P pro-

    vides resource sharing amongst nodes, mobile agents ex-

    tends functions, including P2P-Join operation. In addition,

    peers in BestPeer can dynamically reconfigure their neigh-

    bor candidates. Further, Chord [20] is employed to map

    meta-data (e.g., key, foreign key) among peers.

    An experimental study has been conducted upon the pro-

    totype, and the primary results are promising. Furthermore,

    with the two heuristics being implemented, the performance

    is greatly improved compared with original proposal.

    6 Conclusion

    This paper managed to deploy relational database opera-

    tion upon P2P computing. First, the concept of P2P-Join is

    proposed, which can combine tuples among relations from

    different peers containing certain keywords in the query.

    Further, a fully distributed method to realize P2P-Join pro-

    cessing is devised, which inherits the syntax and semantics

    of traditional join and cherishes the ideology of P2P as well.

    Finally, two enhancements are proposed to improve the per-formance of the proposed P2P-Join operation. Since rela-

    tional database-enabled operation in P2P computing is still

    at its infant stage, some other issues need to be addressed,

    e.g. network optimization and cache management, which

    are the topics of our future research.

    References

    [1] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A sys-

    tem for keyword-based search over relational databases. In

    Proceedings of the 18th ICDE, CA, April 2002.

    [2] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti, and S. Su-

    darshan. Keyword searching and browsing in databases us-

    ing banks. In Proceedings of the 18th ICDE, CA, April 2002.[3] P. Druschel. and A. Rowstron. Past: Persistent and anony-

    mous storage in a peer-to-peer networking environment. In

    Proceedings of the 8th IEEE Workshop on HotOS, 2001.

    [4] Gnutella Homepage. http://gnutella.wego.com/.

    [5] S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu.

    What can databases do for peer-to-peer. In WebDB, 2001.

    [6] Groove Home Page. http://www.groove.net.

    [7] A. Y. Halevy, Z. G. Ives, and D. Suciu. Schema mediation in

    peer data management systems. In Proceedings of the 19th

    ICDE, 2003.

    [8] M. Harren, J. Hellerstein, R. Huebsch, B. Loo, S. Shenker,

    and I. Stoica. Complex queries in dht-based peer-to-peer net-

    works. In IPTPS02, 2002.

    [9] V. Hristidis and Y. Papakonstantinou. Discover: Keyword

    search in relational databases. In VLDB2002, 2002.

    [10] P. Kalnis, B. C. Ooi, W. S. Ng, D. Papadias, and K. L. Tan.

    An adaptive peer-to-peer network for distributed caching of

    olap results. In ACM SIGMOD, 2002.

    [11] T. Katchaounov. Query processing in self-profiling compos-

    able peer-to-peer mediator databases. In Proc. EDBT Ph.D.

    Workshop 2002, 2002.

    [12] A. Kementsietsidis, M. Arenas, and R. Miller. Data mapping

    in peer-to-peer systems. In Proceedings of the 19th ICDE,

    2003 (Poster Paper).

    [13] MSN Home Page. http://www.msn.com/.

    [14] W. S. Ng, B. C. Ooi, and K. L. Tan. Bestpeer: A self-

    configurable peer-to-peer system. In Proceedings of the 18th

    ICDE, San Jose, CA, April 2002 (Poster Paper).

    [15] W. S. Ng, B. C. Ooi, K. L. Tan, and A. Zhou. Peerdb: A p2p-

    based system for distributed data sharing. In Proceedings of

    the 19th ICDE, 2003.

    [16] A. B. Philip, G. Fausto, K. Anastasios, M. John, S. Luciano,

    and Z. Ilya. Data management for peer-to-peer computing:

    A vision. In WebDB Workshop, 2002.

    [17] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and

    S. Shenker. A scalable content-addressable network. In Pro-

    ceedings of SIGCOMM, 2001.

    [18] A. Rowstron and P. Druschel. Pastry: Scalable, distributed

    object location and routing for large-scale peer-to-peer sys-

    tems. In Proceedings of the International Conference on

    Distributed Systems Platforms (Middleware), Germany, Nov.

    2001.

    [19] Seti@home Home Page. http://setiathome.ssl.berkely.edu/.

    [20] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakr-

    ishnan. Chord: A scalable peer-to-peer lookup service for

    internet applications. In Proceedings of SIGCOMM, 2001.

    Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 2006

    Authorized licensed use limited to: Maharashtra Institute of Technology Downloaded on August 16 2010 at 11:49:18 UTC from IEEE Xplore Restrictions apply