SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH

Spatial Approximation Sample Hierarchy

Authors: Michael E. Houle, Jun Sakuma

SASH features

Index data in high-dimensional space

Fast construction of the index N log N

Fast lookups of k approximate nearest neighbors k log N

Drawbacks of other methods

Slow construction Require a k-NN index to construct a k-

NN index Slow lookups

Reduce to grid searches or sequential search

But they may allow for true nearest neighbor queries

SASH construction

Two-phase process Phase 1: divide the set into a

hierarchy of subsets Phase 2: link elements of the

hierarchy together

SASH construction: phase 1 Start with a set of points in a

metric space Divide the set in half randomly Repeatedly divide the “second

half” of the set until there is one element remaining

This hierarchy of sets reminds me of a skip list

SASH subsets

Partitioning process roughly yields log N sets of size 2k, 0 ≤ k ≤ log N

Label the sets S0 (for the set containing one element, namely the root) through Sh (for the largest set containing approximately N/2 elements)

SASH appearance A SASH is hierarchy of sets of size 2k,

0 ≤ k ≤ h, with directed edges from the set of size 2k-1 to the set of size 2k

A SASH is generally not a tree, but it has some of the flavor of a binary tree with edges from sets of a certain size to sets that are double that size.

A SASH usually has many more edges.

SASH construction: phase 2 The SASH is constructed inductively

by first setting SASH0 = S0. For 1 ≤ i-1 ≤ h, SASHi-1 is a partial

SASH on the set S0 U S1 U … U Si-1 SASHi is constructed by starting

with SASHi-1 and producing new directed edges from elements in Si-1 to elements in Si.

SASH construction: phase 2 Let SASH0 be the root, S0

For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible

parents of c in Si-1

Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent

If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.

SASH parameters: P and C In practice, the P is a small, and the C is at

least twice P (Their experiments use C=4P) It is likely that objects will have at least one

parent that links to them, and if C > 2P, all orphans can eventually find parents

Children link to “nearby” parents, and parents then link to “nearby” children The symmetric use of “nearby” gives good

results, even though the relation isn’t really symmetric.

A Completed SASH

Example on the real line with P=2 and C=4

Randomly divide the set in half until reaching one point

The sets Si

SASH Construction Example

Red nodes are in a completed SASH. Light blue nodes are in the process of being added to a SASH. Black nodes have not been processed.

Links from children to parents are green, and links from parents to children are red.

SASH0:Construction P=2, C=4

SASH0:Complete

SASH1:Link children to parents

SASH1:Link parents to children

SASH1:Complete

SASH2:Construction

SASH2:Complete

SASH3:Construction

Some of the green arrows were not reversed

Because parents only link to their C=4 closest children

The green arrows are not parts of the completed SASH

SASH3:Complete

The green links were not returned to the children

The three purple nodes are orphans

Link them by doubling P as needed.

Orphans link to P=4 parents

Parents link to up to C=4 children

Two orphans were linked, and one remains

Link the final orphan to P=8 parents

Link parents to the orphan

The final green arrows are removed

SASH4:Complete

What am I hiding from you about this algorithm?





This part can be expensive





Cost of this operation

For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 There are N/2 points in Sh, and N/4

points in Sh-1, for N2/8 checks Or we could build an index, like a

quadtree and do a k-NN search directly This is expensive, and is the catch-22

of most k-NN algorithms SASH uses an N log N method

Avoiding k-NN search in SASH construction

Instead, perform a partial search query on the new point using the partially constructed SASH Start with the root as the current set While not at the bottom of the partial

SASH, let the current set equal the P children of the current set that are closest to the new point

Approximate parent search without a k-NN graph

Start at the root

Search children

Keep the 2 children closest to the query point

Search children

Keep the 2 children closest to the query point

Search children

Keep the 2 closest children to the query point

These are the approximate parents of the query point

Important points:

No k-NN index needed Log N search time for each element

Up to P objects retained at each level, and each of those has up to C children

Only those PC children are searched at each level to find the P closest objects to send down to the next level.

SASH Issues

When a large number of children are clustered near a few parents, some children will be orphaned and have parents that are farther away

A SASH is mostly static Some new nodes can be added, but

clusters need to be filtered up through the hierarchy during the construction process

Queries with a completed SASH

Similar to the process described above to get approximate parents

Two types of searches described Uniform: Keep the same number of

children at each level Geometric: Start the search with a

small number of nodes kept at each level, then increase it

Queries with a completed SASH

The big difference between constructing the SASH and using it for queries is that in the construction process, only the nodes in the final partial SASH are used.

In a query on a completed SASH, all of the intermediate points visited can be used in the final k-ANN search

Geometric search

Keeping too few points near the root may lead to bad results, so instead of starting near 1, the authors found that 0.5*PC (4 in the case of P=2, C=4) nodes at smaller levels sufficed to keep the search broad enough

Search process

Let ki be the number of elements we will keep at level i of the SASH

Let U0=S0, the root For 1 ≤ i ≤ h

Find all children of elements in Ui-1

Let Ui be the ki children of Ui-1 that are closest to the query point

Search process

After the sets U0, …, Uh have been determined, let U = U0 U U1 U … U Uh

Then the final result is the k closest points in U to the query point

Search complexity

Each Ui has at most k elements, and each of those has at most C children, so we perform at most Ck distance calculations for log N levels, in k log N time

Once U has been determined, we perform a true k-NN search on a set of size k log N

Use of transitivity when searching

We follow links from parents to children under the assumption that children are close to parents

We keep only the objects closest to the query at each level

This gives good results in practice, but may fail in pathological cases

Pathological example of failure of transitivity

Pathological case on the real line Assume the rest of the SASH is to

the left or the right of the chains shown (following the dotted arrows)

The query will return two of the nodes visited at the top, even though there are points closer to the query, Q

Pathological example of failure of transitivity when k=2

R

S T

Q

A B

A search for Q first finds S and T

R

S

Q

T

A B

T’s children are closer to Q than those of S

R

S

Q

T

A B

The search continues below T

R

S

Q

T

A B

R and S are returned as the k=2 nearest neighbors of Q

R

S

Q

T

A B

However, A and B are the true k=2 nearest neighbors of Q

R

S

Q

T

A B

SASH Comparison to MTree MTree (Ciaccia, Patella, Zezula) – Deals

with overlapping objects, uses a balanced hierarchy with buckets and spheres as regions

SASH-4: P=4, C=4P MEDLINE – 1,055,073 objects with

1,101,003 attributes. Represents keywords found in medical abstracts. Average 75 nonzero attributes per object

SSeq = sequential search on a randomly selected subset of the data

Complexity Comparison

Speed vs. accuracy

Internal SASH Comparisons

BactORF – Bacterial protein sequences; 385,039 objects with 40,000 attributes – Sparse: 125 nonzero attributes per object

VidFrame – Video -- 9,000,000 objects with 32 attributes densely nonzero

SASH P=3,4,5,8,16; C=4P

Boosted SASH

Different dataset sizes

Conclusion SASH indexes high-dimensional spaces Efficient construction and query times Uses approximate similarity, and a

generalization of equivalence relations (symmetry and a weak form of transitivity) to get good results

Large body of work in fuzzy logic on transitivity and approximate similarity

SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Documents

Transcript of SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.