AmbientDB Relational Query Processing in a P2P Network
description
Transcript of AmbientDB Relational Query Processing in a P2P Network
AmbientDBAmbientDBRelational Query Processing in a P2P NetworkRelational Query Processing in a P2P Network
Peter Boncz and Caspar Treijtel
LEE BYUNGILPL Lab.
Hongik University
2004.11.14
2
OutlineOutline
1. Introduction1.1 Goal
1.2 Assumptions
1.3 Example: Collaborative Filtering in a P2P Database
1.4 Overview
2. AmbientDB Architecture2.1 Data Model
2.2 Query Execution in AmbientDB
2.3 Dataflow Execution
2.4 Executing the Collaborative Filtering Query
3. DHTs in AmbientDB3.1 Example: Approximated Collaborative Filtering
4. Conclusion
3
1. Introduction (1)1. Introduction (1)
AmbientDB A new peer-to-peer (P2P) DBMS prototype Developed at CWI (Centrum voor Wiskurde en Informatica) Distributed an ad-hoc P2P network Global query algebra
Multi-wave stream processing plans
Ambient Intelligence (AmI) Digital environments in which multimedia services are sensitive
to people’s needs
4
Music Playlist ScenarioMusic Playlist Scenario
amP2P player Log - mata information
Homogeneous
Content - AmbientDB instance, or external sources
Heterogeneous
AmbientDB Its collection Only Meta-information
5
1.1 Goal1.1 Goal
Full relational database functionalityCooperate in ad-hoc way with other AmbientDB devic
es
Propose A general architecture for AmbientDB Complex query processing in ad-hoc P2P network
6
1.2 Assumptions (1)1.2 Assumptions (1)
Upscaling (flexibility) Amount of cooperating devices to be potentially large Home environment and ad-hoc P2P network
Downscaling Devices often have few resources (CPU, memory, network, battery)
Schema integration All devices operate under a common global schema
Data placement Data placement is determined by user
Network failure Resilience of Chord While a query runs, the routing tree stays intact
7
ChordChord
8
1.2 Assumptions (2)1.2 Assumptions (2)
Distributed database Priori Not in AmbientDB
Federated database Statically Heterogeneous schema integration
Mobile database Centralized database server and client (mobile node)
P2P file sharing system Non-centralized and ad-hoc topologies Simple keyword text search
9
Example Music SchemaExample Music Schema
The global schema “AMP2P” in AmbientDB
distributed table On the global level The union of all horizontal frag
ments of these tables
10
1.3 Example : Collaborative1.3 Example : Collaborative Filtering in a P2P Database Filtering in a P2P Database (1)(1)
amP2P player Access to a local content repository (digital music collection) AmbientDB instance
Share all music content in the “home zone” Only share the meta-information in the huge P2P network
11
1.3 Example : Collaborative1.3 Example : Collaborative Filtering in a P2P Database Filtering in a P2P Database (2)(2)
Memory-based implicit voting scheme
Predicted vote for the active user for item j vi,j = the vote of user i on item j
w(a,i) = weight function defined on the active user and user i
vi = average vote for user i
k = nomalizing factor
weight(usera, useri) Times the example song has been fully played by user i
Refined form Negative information – skipped
12
Collaborative Filtering Query in SQLCollaborative Filtering Query in SQL
13
1.4 Overview1.4 Overview
General architecture Include Data model
Query execution Three-level query execution process
DHT (Distributed Hash Table) Global table indices
Optimize the queryRelated work & future workConclusion
14
AmbientDB ArchitectureAmbientDB Architecture
15
2. AmbientDB Architecture2. AmbientDB Architecture
Distributed Query processor Execute query on all ad-hoc connected devices
P2P protocol Chord
scalable lookup and routing scheme P2P IP overlay networks made out of unreliable connections Query node = root A small number of connections per node Simultaneous bi-directional communication and query processing
DHTs – global table indices Local DB component
Local table Embedded database External data source – wrapper component (distributed database system)
Schema integration engine Meta-data translation Using view-based schema mappings
16
AmbientDB Routing Tree Using IP OverlayAmbientDB Routing Tree Using IP Overlay
17
2.1 Data Model (1)2.1 Data Model (1)
Standard relational data model & algebra as query language
Query are formulated against global tablesLocal node or limited set of node or all
reachable nodesConverging answer
Query locally
Re-issue iteratively over more nodes
18
2.1 Data Model (2)2.1 Data Model (2)
Abstract Table LT (Local Table)
Each node has private schema
Global schema – global table T
All participating nodes Ni carry a table instance Ti
In query node Ti may be accessed as a LT
DT (Distributed Table)
Q : Set of node that participate in some global query
The union of local table instances
19
2.1 Data Model (3)2.1 Data Model (3)
PT (Partitioned Table) Specialization of the DT All participating tuples in each Ti are disjunct between all nodes Advantage over DT
Exact query answers can often be computed in an efficient distributed fashion By broadcasting a query and letting each node compute a local result without n
eed for communication
Attaching a bitmap index Ti.Q to each local table Ti
“virtual” column #NODEID
Be aware in which node are located Stored in a DT/PT
Location-specific query restrictions
20
LT, DT and PTLT, DT and PT
21
2.2 Query Execution in AmbientDB (1)2.2 Query Execution in AmbientDB (1)
Three level translation Abstract level
User query Selection, join, aggregation, sort Lists
(List<Type>)
List instances <a,b,c>
Concrete level Table parameters, return value Partition, union
Execution level Wave-plans
22
The Abstract Global AlgebraThe Abstract Global Algebra
23
The Concrete Global AlgebraThe Concrete Global Algebra
24
2.2 Query Execution in AmbientDB (2)2.2 Query Execution in AmbientDB (2)
Starting at the leaves Abstract query plan -> concrete Concrete operator have concrete result type Process continue to the root of the query graph
Local result table, hence LT
Local concrete variant of all abstract operators All tables -> LT
Concrete union (T1)-> LT More efficient alternative query plans
25
2.2 Query Execution in AmbientDB (3)2.2 Query Execution in AmbientDB (3)
select, aggr, order support distributed execution(dist) Execute in all node on their local partition (LT) of a PT or a DT Produce again a distributed result (PT or DT) Broadcast the query through the routing tree The result is again dispersed over all node as a PT or DT
Aggrmerge = aggrlocal(unionmerge(DT)):LT Reduce the fragments to be collected in the query node Save considerable bandwidth
26
2.2 Query Execution in AmbientDB (4)2.2 Query Execution in AmbientDB (4)
join variants Broadcast join (LT, T1)->T1 Foreign-key join (T1,DT)->T1
Referential integrity to minimize communication
Split join (LT1,T1)->T1 Reduce bandwidth consumption
O(T*N) -> O(T*log(N))
partition A special operator that performs double elimination Create a PT from a DT by creating a tuple participation bitmap at all no
des To be able to use the dist operators
We should convert a DT to a PT
27
MappingsMappings
28
2.3 Dataflow Execution (1)2.3 Dataflow Execution (1)
Query processing paradigm Routing tree using TCP connections is used to pass bi-direction
al tuple streams Multiple simultaneous such waves (upward and downward)
Third translation phase Concrete query plan -> wave-plans Concrete operator
One or more waves (Local dataflow aglebra operators)
29
2.3 Dataflow Execution (2)2.3 Dataflow Execution (2)
dist plans for select, aggr, order and foreign-key join buffer-to-buffer local operator in each node, without further communic
ation
broadcast join Propagates a tuple wave through the network
split Split(<true,true>,<c1,c1>) Ordered -> effectively forming a DT/PT
scan-select, quick-sort, merge-join, heap-based top-N, ordered aggregation All stream-based Require little memory
30
The Dataflow AlgebraThe Dataflow Algebra
31
2.4 Executing the Collaborative Filtering Query 2.4 Executing the Collaborative Filtering Query (1)(1)
32
2.4 Executing the Collaborative Filtering Query 2.4 Executing the Collaborative Filtering Query (2)(2)
33
2.4 Executing the Collaborative Filtering Query 2.4 Executing the Collaborative Filtering Query (3)(3)
Problems Query 1
Large list of all users that have ever listened to the example song Hog resources from all nodes in the network
Query 2 Basically send all log record to the query node for aggregation
More efficiently in an AmbientDB enriched with DHTs
34
3. DHTs in AmbientDB (1)3. DHTs in AmbientDB (1)
Useful lookup structures for large-scale P2P applications
Reduce the amount of nodes involved in answering a query Involving many nodes
Decrease query performance Create an overload in the average query frequency
Gnutella (not use DHT or global indices) Easy to locate popular music Difficult to locate less wel-known songs
35
3. DHTs in AmbientDB (2)3. DHTs in AmbientDB (2)
To enable the query optimizer to automatically accelerate selection queries using such DHTs
DHT indices can be exploited by a query optimizer to accelerate lookup queries
Special form of a PT, as the partitions are disjunctselectchord(DHT):LT
Dataflow level Route a message to the Chord finger on which the selection key-value has
hes Retrieving all corresponding tuples as an LT via a direct TCP/IP transfer
Non-complete index
36
DT and DHT in AmbientBDT and DHT in AmbientB
37
3.1 Example: Approximated Collaborative Filtering (1)3.1 Example: Approximated Collaborative Filtering (1)
HISTO Static histogram of fully-
listened-to songs per user
Reduce the histogram computation cost of query
38
Optimized collaborative filtering query in Optimized collaborative filtering query in SQLSQL
39
3.1 Example: Approximated Collaborative 3.1 Example: Approximated Collaborative Filtering (2)Filtering (2)
40
3.1 Example: Approximated Collaborative 3.1 Example: Approximated Collaborative Filtering (3)Filtering (3)
41
Network Bandwidth ComparedNetwork Bandwidth Compared
42
4. Conclusion4. Conclusion
Full query processing architecture Executing queries in a declarative, optimizable language, over an
ad-hoc P2P network
DHT Efficient global indices