Multidimensional grid technique - Vilniaus universitetasalgis/dsax/DDS-7-Grid.pdf · the record...
Transcript of Multidimensional grid technique - Vilniaus universitetasalgis/dsax/DDS-7-Grid.pdf · the record...
2
Grid file
• Increase database and integrated information systems usage
• File structures => efficient access to records:
• combine attribute values (multikeys)
• traditional file structures that provide multikey access to records are extensions of single-key access.
• they manifest various deficiencies in particular for multikey access to highly dynamic files
3
Traditional single-key access
• One dimensional case – hashing:
Hashing
Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements are assigned keys from the set of small natural numbers. That is, U � Z+ and ~U~ is relatively small. If no two elements have the same key, then this dictionary can be implemented by storing its elements in the array T[0, … , ~U~ - 1]. This implementation is referred to as a direct-access table since each of the requisite DICTIONARY ADT operations - Search, Insert, and Delete - can always be performed in 4(1) time by using a given key value to index directly into T, as shown:
The obvious shortcoming associated with direct-access tables is that the set U rarely has such "nice" properties. In practice, ~U~ can be quite large. This will lead to wasted memory if the number of elements actually stored in the table is small relative to ~U~. Furthermore, it may be difficult to ensure that all keys are unique. Finally, a specific application may require that the key values be real numbers, or some symbols which cannot be used directly to index into the table. An effective alternative to direct-access tables are hash tables. A hash table is a sequentially mapped data structure that is similar to a direct-access table in that both attempt to make use of the random-access capability afforded by sequential mapping. However, instead of using a key value to directly index into the hash table, the index is computed from the key value using a hash function, which we will denote using h. This situation is depicted as follows:
Hashing
Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements are assigned keys from the set of small natural numbers. That is, U � Z+ and ~U~ is relatively small. If no two elements have the same key, then this dictionary can be implemented by storing its elements in the array T[0, … , ~U~ - 1]. This implementation is referred to as a direct-access table since each of the requisite DICTIONARY ADT operations - Search, Insert, and Delete - can always be performed in 4(1) time by using a given key value to index directly into T, as shown:
The obvious shortcoming associated with direct-access tables is that the set U rarely has such "nice" properties. In practice, ~U~ can be quite large. This will lead to wasted memory if the number of elements actually stored in the table is small relative to ~U~. Furthermore, it may be difficult to ensure that all keys are unique. Finally, a specific application may require that the key values be real numbers, or some symbols which cannot be used directly to index into the table. An effective alternative to direct-access tables are hash tables. A hash table is a sequentially mapped data structure that is similar to a direct-access table in that both attempt to make use of the random-access capability afforded by sequential mapping. However, instead of using a key value to directly index into the hash table, the index is computed from the key value using a hash function, which we will denote using h. This situation is depicted as follows:
5
Grid file
l The grid file l based on dynamic hashing for multi-attribute data
l two basic structures: k-linear scales + k-dimensional directory
l grid directory: k-dimensional array
l data page is allowed to store objects from several grid cells as long as the union of these cells from a rectangle, storage region
Grid file
l Grid partition of the search space: l problem: spatial queries in k-d point-sets l main idea: try to generalize hashing to k-d
4
size of this bitmap is impossibly large. Since this bitmap contains a lot of zeros, it can be
compressed. Here we need a compression scheme, that is compatible with the operations
executed on a file. FIND, INSERT and DELETE must be executed efficiently.
2.2 Grid Partition of the Search Space
2.2.1 Grid Blocks
The partitions are obtained by dividing the domain of each attribute into intervals.
Example for the two-dimensional case (generalization to k dimensions is obvious):
Figure 1: Grid Partition of the Search Space
As seen in Figure 1, on the record space S = X x Y we obtain a grid partition P = U x W
by imposing intervals U = (u0, u1, u2, u3), V = (v0, v1, v2, v3) on each axis and dividing
the record space into blocks, which we call grid blocks. With grid partitions each
boundary cuts the entire search space into two. All dimensions are treated symmetrically.
A file structure allocates storage in units of fixed size, called disk blocks, pages or
buckets, depending on the level of description. A storage unit, that contains records, is
called bucket. A bucket has a capacity c, which is the number of records it can contain.
2.2.2 Partition Modification
The grid partition is dynamic and can be modified. The grid partition P = U x V is
modified by altering only one of its components at a time. A one-dimensional partition is
modified by splitting one of its intervals, or by merging two adjacent intervals into one.
u2 u1 u0
v2
v1
v0
X
Y
u3
v3
7
Initial approach of locational data 16-4 Handbook of Data Structures and Applications
(27,35)Omaha
(52,10)Mobile
(62,77)Toronto
(82,65)Buffalo
(85,15)Atlanta
(90,5)Miami
(35,42)Chicago
(0,100) (100,100)
(100,0)(0,0)
y
x
(45,30)Memphis
FIGURE 16.1: Uniform-grid representation corresponding to a set of points with a searchradius of 20.
scales are usually implemented as one-dimensional trees containing ranges of values.The array access structure is fine as long as the data is static. When the data is dynamic,
it is likely that some of the grid cells become too full while other grid cells are empty. Thismeans that we need to rebuild the grid (i.e., further partition the grid or reposition thegrid partition lines or hyperplanes) so that the various grid cells are not too full. However,this creates many more empty grid cells as a result of repartitioning the grid (i.e., emptygrid cells are split into more empty grid cells). The number of empty grid cells can bereduced by merging spatially-adjacent empty grid cells into larger empty grid cells, whilesplitting grid cells that are too full, thereby making the grid adaptive. The result is that wecan no longer make use of an array access structure to retrieve the grid cell that containsquery point p. Instead, we make use of a tree access structure in the form of a k-ary treewhere k is usually 2d. Thus what we have done is marry a k-ary tree with the fixed-gridmethod. This is the basis of the point quadtree [22] and the PR quadtree [56, 63] whichare multidimensional generalizations of binary trees.
The difference between the point quadtree and the PR quadtree is the same as thedifference between trees and tries [25], respectively. The binary search tree [45] is an exampleof the former since the boundaries of different regions in the search space are determinedby the data being stored. Address computation methods such as radix searching [45] (alsoknown as digital searching) are examples of the latter, since region boundaries are chosenfrom among locations that are fixed regardless of the content of the data set. The process isusually a recursive halving process in one dimension, recursive quartering in two dimensions,etc., and is known as regular decomposition.
In two dimensions, a point quadtree is just a two-dimensional binary search tree. The firstpoint that is inserted serves as the root, while the second point is inserted into the relevantquadrant of the tree rooted at the first point. Clearly, the shape of the tree depends onthe order in which the points were inserted. For example, Figure 16.2 is the point quadtree
Buffalo, Memphis, Omaha, Atlanta, and Miami.
© 2005 by Chapman & Hall/CRC
corresponding to the data of Figure 16.1 inserted in the order Chicago, Mobile, Toronto,
Grid file
l Special kind of hashing l Adaptable: w.r.t. insert/delete
l Efficient query handling l Dynamic : Access time is uniform (two-disk-access
principle) l Symmetric: No Secondary Key. Every key is the
Primary Key l Multikey: records using subset of keys
l useful for range queries that would map into a set of cells corresponding to a group of values along the linear scales.
l can be applied to any number of search keys:
l n search keys => n dimensions.
l they perform well in terms of reduction in time for multiple key access.
Grid file
l Allocates storage in units of fixed size l Disk blocks/pages/buckets
l To map grid blocks to buckets ?
l Use grid directory
l Two-disk-access: Retrieve single record in at-most 2 disk access
l Access directory(grid)
l Access Bucket(database)
l Efficient range queries
Grid file
Next in each direction
l Nextxabove: cx = (cx+1) mod nx l Nextxbelow: cx = (cx-1) mod nx
l Nextyabove: cy = (cy+1) mod ny
l Nextybelow: cy = (cy-1) mod ny
Advantages
l No special computations are required l Only the right records are retrieved
l Can also be used for single search key queries
l Easy to extend to queries on n search keys
l Significant improvement in processing time for multiple-key queries
l Has a two-disk-access upper bound for accessing data
l Allows simpler concurrency control protocols
Grid files - disadvantages
l #1: problems in high-d: directory splits can be expensive
l #2: even in low-d, suffers on correlated attributes
Grid files - disadvantages
l #3: how about region data? l if we ‘cut’ them, then we have O(volume) pieces
(while z-ordering: O(surface)) l Translation to 2k – d points! (clever, BUT, still has
subtle problems) E.g., 1-d ‘regions’
A B C
x-start
x-end
0 1 ½ ¼ ¾
0 1 ½ ¼ ¾
A B
C
Grid files - disadvantages
l what to do? l Translation to 2kd points! (clever, BUT, still has
subtle problems) E.g., 1-d ‘regions’
A B C
x-start
x-end
0 1 ½ ¼ ¾
0 1 ½ ¼ ¾
A B
C
Disadvantages
l dimensionality curse; large query regions l imposes space overhead
l performance overhead on insertion and deletion
l a frequent reorganization of the file adds to the maintenance cost
Bang file
l A BANG file (balanced and nested grid file) is a point access method which divides space into a nonperiodic grid.
l Each spatial dimension is divided by a linear hash.
l Cells may intersect, and points may be distributed between them.
Bang file
l It organizes the value space surrounding the data values, instead of comparing the data values themselves.
l Its tree structured directory partitions the data space into block regions with successive binary divisions on dimensions.
l The clustering algorithm identifies densely populated regions as cluster centers and expands those with neighboring blocks.
Twin grid file
Two-Level Grid File
Twin Grid File
Given set of points can be distributed among two grid files in such a way that storage space utilization is optimal. The optimal twin grid file can be built practically as fast as a standard grid file, i.e. the storage space optimality is obtained at almost no extra cost.
Twin grid file
The performances of the standard grid file, the optimal static twin grid file, and an efficient dynamic twin grid file, where insertions and deletions trigger the redistribution of points among the two grid files.
Twin grid files utilize storage space at roughly 90%, as compared with the 69% of the standard grid file.
Typical range queries - the most important spatial search operations - can be answered in twin grid files at least as fast as in the standard grid file.
Buddy tree The buddy tree is a dynamic hashing scheme with a tree-like directory. The universe is cut recursively into two parts of equal size with iso-oriented hyperplanes, and each interior node corresponds to a partition together with interval. The interval corresponds to MBB, covering points below of given node. Also: l Each directory node contains at least two entries; l Whenever a node is split, the MBB and subnodes are
recomputed, to fit situation; l Except for the root of the directory, there is exactly one
pointer referring to each directory page.
Buddy tree Buddy Tree
The buddy tree is a dynamic hashing scheme with a tree-like directory. The
universe is cutted recursively into two parts of equal size with iso-oriented
hyperplanes, and each interior node corresponds to a partition together with
interval. The interval corresponds to MBB, covering points below of given
node. Also:
• Each directory node contains at least two entries;
• Whenever a node is split, the MBB and subnodes are recomputed, to fit
situation;
• Except for the root of the directory, there is exactly one pointer referring to
each directory page.
Buddy tree
of the @J&IV Tra - The buddy-tree organizes data using a tree-based directory where each axis is treated equally. In contrast to the K-D- B-tree Rob811 (one of the first multidimensional trees), the buddy-tree performs well in a highly dynamic environment, i. e. insertions, deletions and a change of the data distribution do not affect performance. This property is achieved by applying a modified version of the so-called buddy-system which is well-known from the grid file [NHS84] to the buddy-tree. Additionally, the performance of the buddy-tree is almost independent of the sequence of insertions which is an essential drawback of previous tree-structures, like the K-D-B-tree or hB-tree lLS891.
Another important feature of the buddy-tree is that it does not partition empty data space. Therefore queries, such as partial match queries, where the query region intersects with empty data space, can be performed much faster than by conventional structures partitioning the complete data space. This property is very similar to the variants of the R-tree, originally designed for spatial data Con&u-y to the R-tree, the buddy-tree does not allow overlap in the directory nodes and can therefore guarantee that insertions, deletions and exact match queries are restricted to one path of the directory. Additionally, we incorporate an implementation technique in the buddy-tree which in-creases the fan out of the directory nodes (see section 4).
The following catalogue summarizes the design properties of the buddy-tree:
l empty data space is not partitioned l insertion and deletion of a record is restricted to
exactly one path l no overflow pages l directory grows linear in the number of records l performance is basicly independent of the sequence of
insertions l efficient behavior for insertions and deletions l very high fan out of the directory nodes
With the following example we intend to visualize the basic properties of the buddy-tree:
Let the dimension be d = 2, the capacity of a directory page be c = 5 and the capacity of a data page be b = 4. Then the following snapshots depict the growth of the buddy-tree starting with the empty file. In the data pages the actual points are stored. Minimum bounding rectangles of at most 4 points are represented in the directory pages indicated by a light fill pattern. The white area corresponds to empty data space which is not managed by the buddy-tree (important design property). The first line in our example shows states of the buddy- tree with an overflowing data page depicted by a dark fill pattern. In the second line the corresponding subsequent state after the page split is depicted. The rightmost overflow of a data page implies an overflow of the one and only directory page resulting in a buddy-tree of height two.
592
.