SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.
-
date post
18-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.
SASH
Spatial Approximation Sample Hierarchy
Authors: Michael E. Houle, Jun Sakuma
SASH features
Index data in high-dimensional space
Fast construction of the index N log N
Fast lookups of k approximate nearest neighbors k log N
Drawbacks of other methods
Slow construction Require a k-NN index to construct a k-
NN index Slow lookups
Reduce to grid searches or sequential search
But they may allow for true nearest neighbor queries
SASH construction
Two-phase process Phase 1: divide the set into a
hierarchy of subsets Phase 2: link elements of the
hierarchy together
SASH construction: phase 1 Start with a set of points in a
metric space Divide the set in half randomly Repeatedly divide the “second
half” of the set until there is one element remaining
This hierarchy of sets reminds me of a skip list
SASH subsets
Partitioning process roughly yields log N sets of size 2k, 0 ≤ k ≤ log N
Label the sets S0 (for the set containing one element, namely the root) through Sh (for the largest set containing approximately N/2 elements)
SASH appearance A SASH is hierarchy of sets of size 2k,
0 ≤ k ≤ h, with directed edges from the set of size 2k-1 to the set of size 2k
A SASH is generally not a tree, but it has some of the flavor of a binary tree with edges from sets of a certain size to sets that are double that size.
A SASH usually has many more edges.
SASH construction: phase 2 The SASH is constructed inductively
by first setting SASH0 = S0. For 1 ≤ i-1 ≤ h, SASHi-1 is a partial
SASH on the set S0 U S1 U … U Si-1 SASHi is constructed by starting
with SASHi-1 and producing new directed edges from elements in Si-1 to elements in Si.
SASH construction: phase 2 Let SASH0 be the root, S0
For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible
parents of c in Si-1
Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent
If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.
SASH parameters: P and C In practice, the P is a small, and the C is at
least twice P (Their experiments use C=4P) It is likely that objects will have at least one
parent that links to them, and if C > 2P, all orphans can eventually find parents
Children link to “nearby” parents, and parents then link to “nearby” children The symmetric use of “nearby” gives good
results, even though the relation isn’t really symmetric.
A Completed SASH
Example on the real line with P=2 and C=4
Randomly divide the set in half until reaching one point
Randomly divide the set in half until reaching one point
Randomly divide the set in half until reaching one point
Randomly divide the set in half until reaching one point
The sets Si
SASH Construction Example
Red nodes are in a completed SASH. Light blue nodes are in the process of being added to a SASH. Black nodes have not been processed.
Links from children to parents are green, and links from parents to children are red.
SASH0:Construction P=2, C=4
SASH0:Complete
SASH1:Construction P=2, C=4
SASH1:Link children to parents
SASH1:Link parents to children
SASH1:Complete
SASH2:Construction
SASH2:Link children to parents
SASH2:Link parents to children
SASH2:Complete
SASH3:Construction
SASH3:Link children to parents
SASH3:Link parents to children
Some of the green arrows were not reversed
Because parents only link to their C=4 closest children
The green arrows are not parts of the completed SASH
SASH3:Complete
SASH4:Construction P=2, C=4
SASH4:Link children to parents
SASH4:Link parents to children
The green links were not returned to the children
The three purple nodes are orphans
Link them by doubling P as needed.
Orphans link to P=4 parents
Parents link to up to C=4 children
Two orphans were linked, and one remains
Two orphans were linked, and one remains
Link the final orphan to P=8 parents
Link parents to the orphan
The final green arrows are removed
SASH4:Complete
What am I hiding from you about this algorithm?
For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible
parents of c in Si-1
Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent
If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.
This part can be expensive
For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible
parents of c in Si-1
Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent
If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.
Cost of this operation
For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 There are N/2 points in Sh, and N/4
points in Sh-1, for N2/8 checks Or we could build an index, like a
quadtree and do a k-NN search directly This is expensive, and is the catch-22
of most k-NN algorithms SASH uses an N log N method
Avoiding k-NN search in SASH construction
Instead, perform a partial search query on the new point using the partially constructed SASH Start with the root as the current set While not at the bottom of the partial
SASH, let the current set equal the P children of the current set that are closest to the new point
Approximate parent search without a k-NN graph
Start at the root
Search children
Keep the 2 children closest to the query point
Search children
Keep the 2 children closest to the query point
Search children
Keep the 2 closest children to the query point
These are the approximate parents of the query point
Important points:
No k-NN index needed Log N search time for each element
Up to P objects retained at each level, and each of those has up to C children
Only those PC children are searched at each level to find the P closest objects to send down to the next level.
SASH Issues
When a large number of children are clustered near a few parents, some children will be orphaned and have parents that are farther away
A SASH is mostly static Some new nodes can be added, but
clusters need to be filtered up through the hierarchy during the construction process
Queries with a completed SASH
Similar to the process described above to get approximate parents
Two types of searches described Uniform: Keep the same number of
children at each level Geometric: Start the search with a
small number of nodes kept at each level, then increase it
Queries with a completed SASH
The big difference between constructing the SASH and using it for queries is that in the construction process, only the nodes in the final partial SASH are used.
In a query on a completed SASH, all of the intermediate points visited can be used in the final k-ANN search
Geometric search
Keeping too few points near the root may lead to bad results, so instead of starting near 1, the authors found that 0.5*PC (4 in the case of P=2, C=4) nodes at smaller levels sufficed to keep the search broad enough
Search process
Let ki be the number of elements we will keep at level i of the SASH
Let U0=S0, the root For 1 ≤ i ≤ h
Find all children of elements in Ui-1
Let Ui be the ki children of Ui-1 that are closest to the query point
Search process
After the sets U0, …, Uh have been determined, let U = U0 U U1 U … U Uh
Then the final result is the k closest points in U to the query point
Search complexity
Each Ui has at most k elements, and each of those has at most C children, so we perform at most Ck distance calculations for log N levels, in k log N time
Once U has been determined, we perform a true k-NN search on a set of size k log N
Use of transitivity when searching
We follow links from parents to children under the assumption that children are close to parents
We keep only the objects closest to the query at each level
This gives good results in practice, but may fail in pathological cases
Pathological example of failure of transitivity
Pathological case on the real line Assume the rest of the SASH is to
the left or the right of the chains shown (following the dotted arrows)
The query will return two of the nodes visited at the top, even though there are points closer to the query, Q
Pathological example of failure of transitivity when k=2
R
S T
Q
A B
A search for Q first finds S and T
R
S
Q
T
A B
T’s children are closer to Q than those of S
R
S
Q
T
A B
The search continues below T
R
S
Q
T
A B
The search continues below T
R
S
Q
T
A B
The search continues below T
R
S
Q
T
A B
The search continues below T
R
S
Q
T
A B
R and S are returned as the k=2 nearest neighbors of Q
R
S
Q
T
A B
However, A and B are the true k=2 nearest neighbors of Q
R
S
Q
T
A B
SASH Comparison to MTree MTree (Ciaccia, Patella, Zezula) – Deals
with overlapping objects, uses a balanced hierarchy with buckets and spheres as regions
SASH-4: P=4, C=4P MEDLINE – 1,055,073 objects with
1,101,003 attributes. Represents keywords found in medical abstracts. Average 75 nonzero attributes per object
SSeq = sequential search on a randomly selected subset of the data
Complexity Comparison
Speed vs. accuracy
Internal SASH Comparisons
BactORF – Bacterial protein sequences; 385,039 objects with 40,000 attributes – Sparse: 125 nonzero attributes per object
VidFrame – Video -- 9,000,000 objects with 32 attributes densely nonzero
SASH P=3,4,5,8,16; C=4P
Boosted SASH
Different dataset sizes
Conclusion SASH indexes high-dimensional spaces Efficient construction and query times Uses approximate similarity, and a
generalization of equivalence relations (symmetry and a weak form of transitivity) to get good results
Large body of work in fuzzy logic on transitivity and approximate similarity