A Physical Query Algebra for DHT-based P2P Systems Kai-Uwe Sattler 1, Philipp Rösch 1, Erik...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of A Physical Query Algebra for DHT-based P2P Systems Kai-Uwe Sattler 1, Philipp Rösch 1, Erik...
A Physical Query Algebra for DHT-based P2P Systems
Kai-Uwe Sattler1, Philipp Rösch1, Erik Buchmann2, Klemens Böhm2
1Department of Computer Science and Automation, TU Ilmenau2Department of Computer Science, University of Magdeburg
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
2
Distributed Hash Tables
Examples: CAN, CHORD, PASTRY, etc. Advantages of P2P systems, e.g.,
No SPOF, shared infrastructure costs, censorship-resistance
Manage huge sets of (key, value)-pairs Cope with large numbers of parallel
transactions Efficient query processing:
Greedy forward routing, But only simple exact-match queries on
unstructured data sets
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
3
Extended Queries in DHT Some extensions:
Trigrams - text retrieval beethoven: bee eet eth tho hov ove ven
Bloom filters - hash-based AND Feature vectors - multimedia documents
But: Extensions are application-specific No universal query algebra
Idea: Relational data sets, SQL-like queries
Applications: management of genom data, semantic web, distributed indexes
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
4
Relational Data in DHT?
Storing relational data in DHT Fragmentation scheme? Accessing secondary keys?
Support for SQL-like query processing Distribution scheme for complex queries? Join operations? Full-table scan without flooding?
Exploiting the P2P nature No central instance, no global knowledge Parallel processing Problems with availability and failures
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
5
Outline of Our Approach
Use Content-Addressable Networks (CAN) Locality-aware hash function
Preserving neighborhood of similar tuples Space-filling curve
API Extension Multicast Temporary re-hashing
Distributed query plan operators (POP) Selection, join, grouping/aggregation POP distribution scheme
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
6
Content-Addressable Networks
Proposed by S. Ratnasamy (2001)
Keys: d-dimensional points
Key space is a torus in d dimensions
Example: d=2
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
7
Zones and Neighbors in CAN
Each peer is responsible for one zone, i.e., stores all (key, value) pairs of the zone
Each peer knows the neighbors of its zone
Random assignment of peers to zones at startup
Overloading of zones, multiple realities, ...
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
8
Greedy Forward Routing in CAN
get(k):1. Forward request
to that neighbor whose zone is closest to k
2. Repeat until the peer responsible for k is reached
(k,v)
get(k)
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
9
Managing Relational Data:Simple Approach
Relation r R, Tuple t r, t = {ak, a1, ..., an }Key k‘ = h(ak)
Problems:1. Tuples are irregularly
disseminated over the key space, i.e., only exact-match queries are supported
2. No search for attributes other than primary key
xx
x
x
x
x
x
x
σ5<a k<10
(r) ?
σab=20(r) ?
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
10
Fragmentation Scheme Reverse bit interleaving (z-curve)
Tuple t r, t = {ak, a1, ..., an } Two hash functions:
Key k‘ = hr(r) ° hk(ak)
(RelationID,Key Value)
RelationID Key
0 0 0 1 0 1 0 0
0 0 0 1 0 0 1 0
hr hk
Dimension #1 Dimension #2
(1,2)
Key k‘ = h(ak)
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
11
Two Hash Functions
Key k‘ = hr(r) ° hk(ak) hr(r): RelationID
determines the placement of the space-filling curve
hk(ak): primary key determines the position on the curve,locality-awarenessak = 0,
ra, rb, rc
1, 2, 3, 4, ...
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
12
Additional API Primitives
Standard operations: put(k, v), v=get(k) Only two additional operations needed for our
query algebra: put_temp(), multicast()
put_temp(k, v, t) Re-hashing of a given relation Temporary put-operation Allows indexed access to other attributes
than the primary key
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
13
Additional API Primitives (Cont.)
multicast(zmin, zmax, POP) Sends a message to a
group of peers Peers are identified by
an interval of the z-curve
Example: σ3<ak<6(r)
multicast(3,6, POP)
send(σak=3)
send(σ4<ak<6)
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
14
Query Plan Operators (POP)
Hash-based implementation for selection, join, grouping, aggregation
Distributed query processing Operator Trees
R
S
T
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
15
Selection
Selection POP On the primary key:
Example: σ3<ak<6(r) Determine the interval on the z-curve Send selection operator via multicast
On other attributes: Example: σ3<a5<6(r) Perform full-table scan,
e.g., multicast( min(a5), max(a5), POP)
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
16
Join
Nested Loop Join POP, Symmetric Hash Join POP On the primary key:
Perform join immediately On other attributes:
Re-hash the relation using put_temp first Perform join as above
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
17
Example: Symmetric Hash Join
shjoin(R,S)
shjoin(R,S)
shjoin(R,S)
shjoin(R,S)
put_temp(h(tR),tR,x)
put_temp(h(tS),tS,x)R2
R1
S1
S2
RS1
RS2
R S
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
18
Sorting/Aggregation
Central grouping POP: One peer iterates over the z-curve,
performs central sorting/aggregation Hash group POP:
Re-distribute the relation using a hash function on the attribute to be sorted/aggregated
“Aggregation Peers” are responsible for sorting/aggregation of incoming attribute values
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
19
Query Evaluation
Input Left-handed POP trees
Design Principles Stateless evaluation Blocking operations:
delivery of intermediate data (early aggregation)
R
S
T
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
20
r1
Query Evaluation: Example
P0
P1
P2
P3
P4
P5
P0
r2a
r2b
rra
rrb
rr
r1
r2
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
21
Conclusion
Current state: Prototype is fully implemented Execution of plans like
(shjoin a1=a2 (scan a3>42 REL1) (scan REL2))
First experiments in small CAN (100 Peers) are promising
E. Buchmann A Physical Query Algebra for DHT-based P2P Systems
22
Conclusion (cont.)
Future topics: Experiments with large data sets and many
nodes (100,000 nodes, 10 mio. queries, test data from the TCP-H benchmark)
Optimization of the different POP implementations
Efficient range queries Dynamic query operations