SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

90
SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Page 1: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH

Spatial Approximation Sample Hierarchy

Authors: Michael E. Houle, Jun Sakuma

Page 2: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH features

Index data in high-dimensional space

Fast construction of the index N log N

Fast lookups of k approximate nearest neighbors k log N

Page 3: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Drawbacks of other methods

Slow construction Require a k-NN index to construct a k-

NN index Slow lookups

Reduce to grid searches or sequential search

But they may allow for true nearest neighbor queries

Page 4: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH construction

Two-phase process Phase 1: divide the set into a

hierarchy of subsets Phase 2: link elements of the

hierarchy together

Page 5: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH construction: phase 1 Start with a set of points in a

metric space Divide the set in half randomly Repeatedly divide the “second

half” of the set until there is one element remaining

This hierarchy of sets reminds me of a skip list

Page 6: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH subsets

Partitioning process roughly yields log N sets of size 2k, 0 ≤ k ≤ log N

Label the sets S0 (for the set containing one element, namely the root) through Sh (for the largest set containing approximately N/2 elements)

Page 7: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH appearance A SASH is hierarchy of sets of size 2k,

0 ≤ k ≤ h, with directed edges from the set of size 2k-1 to the set of size 2k

A SASH is generally not a tree, but it has some of the flavor of a binary tree with edges from sets of a certain size to sets that are double that size.

A SASH usually has many more edges.

Page 8: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH construction: phase 2 The SASH is constructed inductively

by first setting SASH0 = S0. For 1 ≤ i-1 ≤ h, SASHi-1 is a partial

SASH on the set S0 U S1 U … U Si-1 SASHi is constructed by starting

with SASHi-1 and producing new directed edges from elements in Si-1 to elements in Si.

Page 9: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH construction: phase 2 Let SASH0 be the root, S0

For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible

parents of c in Si-1

Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent

If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.

Page 10: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH parameters: P and C In practice, the P is a small, and the C is at

least twice P (Their experiments use C=4P) It is likely that objects will have at least one

parent that links to them, and if C > 2P, all orphans can eventually find parents

Children link to “nearby” parents, and parents then link to “nearby” children The symmetric use of “nearby” gives good

results, even though the relation isn’t really symmetric.

Page 11: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

A Completed SASH

Page 12: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.
Page 13: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Example on the real line with P=2 and C=4

Page 14: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Randomly divide the set in half until reaching one point

Page 15: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Randomly divide the set in half until reaching one point

Page 16: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Randomly divide the set in half until reaching one point

Page 17: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Randomly divide the set in half until reaching one point

Page 18: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

The sets Si

Page 19: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH Construction Example

Red nodes are in a completed SASH. Light blue nodes are in the process of being added to a SASH. Black nodes have not been processed.

Links from children to parents are green, and links from parents to children are red.

Page 20: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH0:Construction P=2, C=4

Page 21: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH0:Complete

Page 22: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH1:Construction P=2, C=4

Page 23: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH1:Link children to parents

Page 24: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH1:Link parents to children

Page 25: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH1:Complete

Page 26: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH2:Construction

Page 27: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH2:Link children to parents

Page 28: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH2:Link parents to children

Page 29: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH2:Complete

Page 30: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH3:Construction

Page 31: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH3:Link children to parents

Page 32: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH3:Link parents to children

Page 33: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Some of the green arrows were not reversed

Page 34: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Because parents only link to their C=4 closest children

Page 35: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

The green arrows are not parts of the completed SASH

Page 36: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH3:Complete

Page 37: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH4:Construction P=2, C=4

Page 38: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH4:Link children to parents

Page 39: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH4:Link parents to children

Page 40: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

The green links were not returned to the children

Page 41: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

The three purple nodes are orphans

Page 42: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Link them by doubling P as needed.

Page 43: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Orphans link to P=4 parents

Page 44: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Parents link to up to C=4 children

Page 45: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Two orphans were linked, and one remains

Page 46: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Two orphans were linked, and one remains

Page 47: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Link the final orphan to P=8 parents

Page 48: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Link parents to the orphan

Page 49: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

The final green arrows are removed

Page 50: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH4:Complete

Page 51: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

What am I hiding from you about this algorithm?

For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible

parents of c in Si-1

Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent

If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.

Page 52: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

This part can be expensive

For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible

parents of c in Si-1

Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent

If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.

Page 53: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Cost of this operation

For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 There are N/2 points in Sh, and N/4

points in Sh-1, for N2/8 checks Or we could build an index, like a

quadtree and do a k-NN search directly This is expensive, and is the catch-22

of most k-NN algorithms SASH uses an N log N method

Page 54: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Avoiding k-NN search in SASH construction

Instead, perform a partial search query on the new point using the partially constructed SASH Start with the root as the current set While not at the bottom of the partial

SASH, let the current set equal the P children of the current set that are closest to the new point

Page 55: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Approximate parent search without a k-NN graph

Page 56: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Start at the root

Page 57: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Search children

Page 58: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Keep the 2 children closest to the query point

Page 59: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Search children

Page 60: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Keep the 2 children closest to the query point

Page 61: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Search children

Page 62: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Keep the 2 closest children to the query point

Page 63: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

These are the approximate parents of the query point

Page 64: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Important points:

No k-NN index needed Log N search time for each element

Up to P objects retained at each level, and each of those has up to C children

Only those PC children are searched at each level to find the P closest objects to send down to the next level.

Page 65: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH Issues

When a large number of children are clustered near a few parents, some children will be orphaned and have parents that are farther away

A SASH is mostly static Some new nodes can be added, but

clusters need to be filtered up through the hierarchy during the construction process

Page 66: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Queries with a completed SASH

Similar to the process described above to get approximate parents

Two types of searches described Uniform: Keep the same number of

children at each level Geometric: Start the search with a

small number of nodes kept at each level, then increase it

Page 67: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Queries with a completed SASH

The big difference between constructing the SASH and using it for queries is that in the construction process, only the nodes in the final partial SASH are used.

In a query on a completed SASH, all of the intermediate points visited can be used in the final k-ANN search

Page 68: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Geometric search

Keeping too few points near the root may lead to bad results, so instead of starting near 1, the authors found that 0.5*PC (4 in the case of P=2, C=4) nodes at smaller levels sufficed to keep the search broad enough

Page 69: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Search process

Let ki be the number of elements we will keep at level i of the SASH

Let U0=S0, the root For 1 ≤ i ≤ h

Find all children of elements in Ui-1

Let Ui be the ki children of Ui-1 that are closest to the query point

Page 70: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Search process

After the sets U0, …, Uh have been determined, let U = U0 U U1 U … U Uh

Then the final result is the k closest points in U to the query point

Page 71: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Search complexity

Each Ui has at most k elements, and each of those has at most C children, so we perform at most Ck distance calculations for log N levels, in k log N time

Once U has been determined, we perform a true k-NN search on a set of size k log N

Page 72: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Use of transitivity when searching

We follow links from parents to children under the assumption that children are close to parents

We keep only the objects closest to the query at each level

This gives good results in practice, but may fail in pathological cases

Page 73: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Pathological example of failure of transitivity

Pathological case on the real line Assume the rest of the SASH is to

the left or the right of the chains shown (following the dotted arrows)

The query will return two of the nodes visited at the top, even though there are points closer to the query, Q

Page 74: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Pathological example of failure of transitivity when k=2

R

S T

Q

A B

Page 75: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

A search for Q first finds S and T

R

S

Q

T

A B

Page 76: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

T’s children are closer to Q than those of S

R

S

Q

T

A B

Page 77: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

The search continues below T

R

S

Q

T

A B

Page 78: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

The search continues below T

R

S

Q

T

A B

Page 79: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

The search continues below T

R

S

Q

T

A B

Page 80: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

The search continues below T

R

S

Q

T

A B

Page 81: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

R and S are returned as the k=2 nearest neighbors of Q

R

S

Q

T

A B

Page 82: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

However, A and B are the true k=2 nearest neighbors of Q

R

S

Q

T

A B

Page 83: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH Comparison to MTree MTree (Ciaccia, Patella, Zezula) – Deals

with overlapping objects, uses a balanced hierarchy with buckets and spheres as regions

SASH-4: P=4, C=4P MEDLINE – 1,055,073 objects with

1,101,003 attributes. Represents keywords found in medical abstracts. Average 75 nonzero attributes per object

SSeq = sequential search on a randomly selected subset of the data

Page 84: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Complexity Comparison

Page 85: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Speed vs. accuracy

Page 86: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Internal SASH Comparisons

BactORF – Bacterial protein sequences; 385,039 objects with 40,000 attributes – Sparse: 125 nonzero attributes per object

VidFrame – Video -- 9,000,000 objects with 32 attributes densely nonzero

Page 87: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

SASH P=3,4,5,8,16; C=4P

Page 88: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Boosted SASH

Page 89: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Different dataset sizes

Page 90: SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma.

Conclusion SASH indexes high-dimensional spaces Efficient construction and query times Uses approximate similarity, and a

generalization of equivalence relations (symmetry and a weak form of transitivity) to get good results

Large body of work in fuzzy logic on transitivity and approximate similarity