Range and kNN Searching in P2P Manesh Subhash Ni Yuan Sun Chong.
-
Upload
lora-naomi-andrews -
Category
Documents
-
view
222 -
download
0
Transcript of Range and kNN Searching in P2P Manesh Subhash Ni Yuan Sun Chong.
Range and kNN
Searching in P2P
Manesh Subhash
Ni Yuan
Sun Chong
Outline Range query searching in P2P
one dimension range query
multi-dimension range query
comparison of range query searching in P2P
kNN searching in P2P scalable nearest neighbor searching
PierSearch
Conclusion
Motivation
Most P2P systems support only simple lookup queries
The DHT based approaches such as Chord, CAN are not suitable for range queries
More complicated queries such as range query and kNN searching is needed
P-Tree [APJ+04]
B+-tree is widely used for efficiently evaluating range queries in centralized database
distributed B+-tree is not directly applicable in a P2P environment
fully independent B+-tree semi-independent B+-tree, i.e. P-tree
Fully independent B+-tree4 24 26
4 8 12 20 24 25 26 35
P1: 4 P2: 8
P3: 12
P4: 20P5: 24
P6: 25
P7: 26
P8: 35
26 8 20
26 35 4 8 12 20 24 25
24 35 8
24 25 26 35 4 8 12 20
Semi-independently B+-tree
P1: 4 P2: 8
P3: 12
P4: 20P5: 24
P6: 25
P7: 26
P8: 35
4 24 26
4 8 12 20
24 35 8
24 25 26
12 25 35
12 20 24
26 8 20
26 35 4
8 24 35
8 12 20
20 25 4
20 24
25 8 20
25 26 35 4
35 8 24
35 4
Coverage & Separation
4 8 12 20 24 25 26 35
4 8
4 20 24 25
4 35
20 24 24 25 25 26 35 4
35 8 20 25
anti-coverage
overlap
Properties of P-tree
Each node stores O(logdN) nodes
Total storage per node is O(d*logdN) Require no global coordination among all
peers The search cost for a range query that
returns m results is O(m + logdN)
Search Algorithm p1: 21<value <29
P1: 4 P2: 8
P3: 12
P4: 20P5: 24
P6: 25
P7: 26
P8: 35
4 24 26
4 8 12 20
24 35 8
24 25 26
12 25 35
12 20 24
26 8 20
26 35 4
8 24 35
8 12 20
20 25 4
20 24
25 8 20
25 26 35 4
35 8 24
35 4
l0
l1
Multi-dimension range query
Routing in one-dimensional routing space ZNet Z-ordering + Skip graph [STZ04] Hilbert space filling curve + Chord [SP03] SCRAP [GYG04]
Routing in multi-dimensional routing space MURK [GYG04]
Desiderata Locality : the data elements nearby in the
data space should be stored in the same node or the close nodes
Load balance : the amount of data stored by each node should be roughly the same
Efficient routing : the number of messages
exchanged between nodes for routing a query should be small
Hilbert SFC + Chord
SFC d-dimensional cube -> a line the line passes once through each point
in the volume of the cube
00 11
10 010101 0110
0100 0111
1001 1010
1000 1011
0011 0010
0000 0001
1101 1100
1110 1111
Hilbert SFC + Chord
mapping the 1-dimensional index space onto the Chord overlay network topological
0
4
811
14
data elements with keys 5, 6, 7, 8
Query Processing
translate the keyword query to relevant clusters of the SFC-based index space
query the appropriate nodes in the overlay network for data-elements
0101 0110
0100 0111
1001 1010
1000 1011
0011 0010
0000 0001
1101 1100
1110 1111
00 01 10 11
00
01
10
11
(1*, 0*)
0
4
811
141100
1101
1110
1111
000100
Query Optimization (010, *)
000 001 010 011 100 101 110 111
111
110
101
100
011
010
001
000
(000100)
000000
001001011110
111000
(000111, 001000)
(001011) (011000, 011001) (011101, 011110)
Query Optimization (cont.) (010,*)
01 10
1100
0
0001
00010010
01100111
000100 000111001000
001011 011000011001
011101011110
00 01 10 11
0
0001
00010010
01100111
000100 000111001000
001011 011000011001
011101011110
000000
001001011110
111000000100(010, *)
000
01
Query Optimization (cont.)
Pruning nodes from the tree
SCARP [GYG04]
Use z-order or Hilbert space filling curve to map multi-dimensional data down to a single dimension
Range partitioned the one dimension data across the available S nodes
Use Skip graph to rout the queries
MURK: Multi-dimensional Rectangulation with KD-tree Basic conception:
Partitioning high-dimensional data space into “rectangles”, managed by each node.
Partitioning is done based on the KD-tree. The space is split cyclically according to the dimensions and each leaf of the KD-tree corresponds to one rectangle.
Partitioning
Each node joins, split the space along one dimension into two parts of equal load, keeping load balance.
Each node manage data in one rectangle, thus keeping data locality.
Comparison with CAN
The partition based on KD-tree is similar as that in CAN. Both hash data into multi-dimensional space and try to keep load balancing
The major difference is that a new node splits the exiting node data space equally in CAN, rather than splitting load equality.
Routing in MURK
Routing is to create a link between all the neighboring nodes along the relevant nodes.
Based on the greedy routing over the “grid” links, the distance between two node is the minimum Manhattan distance.
Optimization for the routing
“Grid” links are not efficient for the routing. Maintain skip pointers for each node to
speed up the routing. Two methods to chose the skip pointers:Random. Chose randomly a node from node
set.Space-filling skip graph. Make the skip
pointers at exponentially increasing distance.
Discussion
Non-uniformity for the routing neighbors. Resulted from load balancing for the node.
The dynamic data distribution would result in the unbalance for the node data.
Performance
performance
Conclusion
For locality, MURK far outperforms SCRAP. For routing cost, SCRAP is efficient enough, skip pointers are efficient, such as space filling curve skip.
SCRAP using space filling with rang partitioning is efficient in low dimensions. MURK with space filling skip graph performs much better, especially in high dimensions.
pSearch
Motivation Numerous documents are over the internet. How to efficiently search the most closely related
document without returning too many with little interest.
Problem: Semantically, documents are randomly distributed. Exhaustively search brings overhead. No deterministic guarantees.
P2P & IR techniques
Unstructured p2p search Centralized index with the problem bottleneck. Flooding-based techniques result in too much overhead. Heuristic-based algorithm may miss some important documents.
Structured p2p search DHT based can and chord are suitable for keyword matching.
Traditional IR techniques Advanced IR ranking algorithm could be adopted into p2p search. Two IR techniques
Vector space model (VSM). Latent semantic indexing (LSI).
pSearch
An IR system built on p2p networks.Efficient and scalable as DHT Accurate as advanced IR algorithms.
Map semantic space to nodes and conduct nearest neighbor search.use VSM and LSI to generate semantic space use CAN to organize nodes.
VSM &LSI
VSM Document and queries are expressed as term vectors. Weight of a term: Term frequency* inverse document frequency. Rank based on the similarity of the document and query: cos
(X,Y). X and Y are two term vectors. LSI
Based on singular value decomposition, transform term vector from high-dimension to low-dimension (L) semantic vector.
Statistically based conception avoids synonymous and noise in document.
pSearch system
DOC
QUERY
Advantage of pSearch
Exhaustive search in a bounded area while could be ideally accurate.
Communication overhead is limited to transferring query and reference to top documents independent of the corpus size.
A good approximate of the global statistics is sufficient for pSearch.
Challenges
Dimensionality mismatch between CAN and LSI.
Uneven distribution of indices. Large search region.
Dimensionality mismatch
Not enough nodes (N) in the CAN to partition all the dimensions (L) in the LSI semantic space.
N nodes in CAN could partition log(N) low dimensions (effective dimensionality), leaving others un-partitioned.
Rolling index Motivation
Small part of the dimensions would contribute a lot to the similarity
Low-dimensions are of high importance. Partition more dimensions of the semantic space by
rotating the semantic vectors. A semantic vector V=(v0,v1,…,vl). Each time rotate the vector m
dimensions. The rotate space i is the vector of ith rotation.
Vi=(vi*m,…,v0,v1,…, vi*m-1) m=2.3*ln(n).
Use the rotated vector to route the query and guide the search.
Rolling index
Use more storage (p times) to keep the search in local space.
Selective rotation is expected to be efficient to process the important high dimensions
Balance index distribution
Content-aware node bootstrapping. Randomly select a document to publish .Route the node. Transfers load.
More indices would be distributed by more node. Even random, still balance with large corpus.
Reducing search space
Curse of dimensionality Data of high-dimensions sparsely populated In the high-dimension, distance between
nearest neighbor becomes large. Based on data locality, use stored indices
on nodes and recently processed query to guide new search.
Content-directed search
1
f
2 3 4 5 6
7 8 9 a 10
b
11
c
12
13
e
14
d
15 q 16 17 18
19 20 21 22
g
23 24
p
Performance
Conclusion
pSearch is a P2P IR system organizing contents around semantics and achieves good accuracy w.r.t system size, corpus size and returned document.
Rolling index resolve the dimension mismatch and could limit space overhead and visited node number.
Content-aware node bootstrapping balance node load to achieve index and query locality
Content–directed search reduce the searching nodes.
kNN searching in P2P Networks
Manesh Subhash
Ni Yuan
Sun Chong
Outline
Introduction to searching in P2P Nearest neighbor queries Presentation of the ideas in the papers
1. “A Scalable Nearest Neighbor Search in P2P Systems”
2. “Enhancing P2P File-Sharing with an Internet-Scale Query Processor”
Introduction to searching in P2P
Exact Match queriesSingle key retrievalLinear HashCAN, CHORD, PASTRY, TAPESTRY
Similarity based queriesMetric space based
What do we search for?Rare items or popular items or both.
Nearest neighbor queries
The notion of a metric spaceHow similar are two objects given a set of
objectsExtensible for exact, range and nearest
neighbor queries.Computationally expensiveDistance property satisfies positive-ness,
reflexivity, symmetry, triangle inequality.
Nearest neighbor queries (Cont)
Metric space is a pair (D, d) D : domain of objectsd : the distance function.
Similarity queriesRange
for F D, a range query retrieves all objects which have a distance < ρ to the query object q F
Nearest neighborReturns the object closest to q, k-nearest object
for kNN. K F
),(),(:,|| yqdxqdKFyKxkK
Scalable NN search
Uses the GHT* structure.Distributed metric indexSupports range and k-NN queries
The GHT* architecture is composed of nodes, peers that can insert, store and retrieve objects using similarity queries.
Assumptions: Message passing, unique network identifiers, Local buckets to store data and lastly, only one bucket per object.
Example of the GHT* NetworkPeer1 Peer2
To other peers
Network Node ID (NNID) or Bucket ID(BID)
Inner node
Bucket
Scalable NN search (3)
Address Search Trees (AST) Is a binary search tree Inner nodes hold routing information
Two pivots pointers to left and right sub-trees
Leaf nodes are pointers to data Local data is stored in the buckets and can be
accessed using the BID Non local data can be identified using NNID.
(All AST leaf nodes are one of the above pointers)
Scalable NN search (4)
Searching the AST? The BPATH
Is a representation of a tree as a string of n binary elements {0,1}: p = (b1,b2,…,bn)
Use the traversing operator Ψ and radius ρ for a query q. Ψ returns a BPATH.
Ψ examines every inner node using the two pivot values and decides which sub-tree to follow.
A radius of zero is used for exact matches and during inserts.
Scalable NN search (5)
k-NN searching in GHT* Range searching not suitable without intrinsic knowledge
of data and the metric space used. Begin search at bucket with high probability of
occurrences of k objects If k objects are found, then use kth object to define a
similarity search with radius of kth distance from q. Sort result and pick first k. If less than k objects found then we cannot determine the
upper bound on the search for the kth neighbor Variation on range radius
Scalable NN search (6)
Finding the k objects using range searches. Optimistic
Minimize distance computation costs, bucket access. Use bounding distance as that of the last candidate available at the
first accessed bucket. Iteratively expand radius if fewer than k found
Pessimistic. Probability of next iteration is minimized. Use distance between the pivot values at a level of the AST as range
radius starting from parent of leaf and executes the range query. If fever than k, move up the next level.
Scalable NN search (7)
Performance evaluation With increasing k
Number of parallel distance computations remain stable Number of bucket accesses and Number of Messages
increase rapidly
Effect of growing dataset Max hop count increases slowly Nearly constant parallel distance computation costs
Comparison with range Slightly slower because of overhead to locate first bucket
Scalable NN search (8)
Performance of the scheme on the TXT dataset.
Scalable NN search (9)
Conclusion First effort in distributed index structures
supporting K-NN searching. GHT* is a scalable solution Scope for future work includes handling
updates of the dataset. Other metric space partitioning schemes.
Enhanced P2P - PIERSearch (1)
Internet scale query processor Queried data has Zipfian distribution
Popular data in the headLong tail of rare items
PIERSearch is DHT based It’s a Hybrid system, uses Gnutella for popular
items, PIERSearch for rare items Integrated with the PIER system
PIERSearch (2)
Gnutella query processingFlooding basedSimple for popular filesOptimized using
ultra peers: nodes that perform the query processing on behalf of the leaf nodes,
dynamic querying: Larger TTL
Team studied characteristics of the Gnutella network.
PIERSearch (3)
Effectiveness of Gnutella Query recall: Percentage of available results in the
network returned Query distinct recall: Percentage of distinct results,
nullifies the effect of having replicas. Experiments show that Gnutella is efficient for highly
replicated content and those with large result set. Found ineffective for rare content. Increasing the TTL does not reduce latency but can
improve recall
PIERSearch (4)
Searching using PIERSearchKeyword based.Publisher maintains inverted file indexed using
the DHT. Generates two tuples for each item
Item(fileId,filename, filesiz, ipAddress, port) Inverted(keyword,fileId)
Uses the underlying PIER system A DHT based internet-scale relational query processor.
PIERSearch (5)
Hybrid system Identification of rare items
Query result size Smaller than fixed threshold considered rare.
Term frequency Items with at-least one term below threshold considered rare.
Term pair frequency Less prone to skew if filenames contain popular words.
Sampling Samples neighboring nodes and computes lower bound
estimate on the number of replicas.
PIERSearch (6)
Performance summary
PIERSearch (7)
ConclusionWe have found that Gnutella is highly
effective for querying popular content, but ineffective for rare items.
We have found that building a partial index over the least replicated content can improve query recall.
Referemce [APJ+04] A. Crainiceanu, P. Linga, J. Gehrke and J.
Shanmugasundaram. Querying Peer-to-Peer Networks Using P-Trees. In WebDB, 2004
[GYG04] P. Ganesan, B. Yang and H. Garcia-Molina. One Torus to Rule them all: Multi-dimensional Queries in P2P Systems. In WebDB, 2004
[SP03] C. Schmidt and M. Parashar. Flexible Information Discovery in Decentralized Distributed Systems. In HPDC, 2003
[STZ04] Y. Shu, K-L. Tan and A. Zhou. Adapting the Content Native Space for Load Balanced Indexing. In Database, Information Systems and Peer-to-Peer Computing, 2004
Reference (cont.) [LHH+04] B. Loo, J. Hellerstern, R. Huebsch, S.
Shenker and I. Stoica. Enhancing P2P File-sharing with an Internet-Scale Query Processor. In VLDB, 2004.
[TXD03] C. Tang, Z. Xu and S. Dwarkadas. Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks. In SIGCOMM, 2003
[ZBG04] P. Zezula, M. Batko and C. Gennaro. A Scalable Nearest Neighbor Search in P2P Systems. In Database, Information Systems and Peer-to-Peer Computing, 2004
Thank you!