Specialized indexing for NoSQL Databases like Accumulo and HBase
-
Upload
jim-klucar -
Category
Data & Analytics
-
view
214 -
download
1
Transcript of Specialized indexing for NoSQL Databases like Accumulo and HBase
Providing Practical Software Solutions
1
Specialized Indexing
Jim Klucar
Accumulo Meetup
10/16/2012
Accumulo Sorts Key/Value Pairs
Data is retrieved by iterating over ranges of keys
Integers: 1, 11, 2
Floats: 2.3e4, 2.3e-4
Multi-dimensional data (-76.76, 39.12)??
Overview
2
Zero-Pad Integers to maintain sort order
001
002
003
Reverse sort order by subtracting from
larger number before using integer
(1000 - 003) = 997
(1000 - 002) = 998
(1000 - 001) = 999
Integer Indexing
3
Accumulo uses reverse technique on timestamp to have most recent keys sort first
Binary Representation
[sign] [exponent] [mantissa]
If sign bit is set, Flip all bits
If sign bit isn’t set, flip just sign bit
Flipping bits is safely done with XOR
If sign bit is set, XOR with 0xFFFFFFFF
If sign not set, XOR with 0x80000000
IEEE-754 Floating Point Indexing
4
Note: Exponent is biased, not 2’s complement
Same procedure can be performed for base 10
numbers with caveats:
Adjust mantissa and exponent so decimal point is at the
same location (32.4e010 -> 3.24e009)
Apply same logic to exponent
Move both signs to first characters
Flipping a bit is subtracting digit from 9
-3.24e-9 becomes ++6.7500000e990
Be sure to print enough digits
Single (32-bit) 8 decimal digits, 3 exponent
Double(64-bit) 15 decimal digits, 4 exponent
Quad (128-bit) 35 decimal digits, 5 exponent
But Binary is so 1970’s…
5
Sorted Key/Value tables are 1 dimensional
Keys are either greater than or less than each
other
Need a method to map higher dimensional
data (eg 2-D,Lon/Lat) to a lower dimension
Maintain locality of higher dimensional
data in lower dimensional space
Maintain 2D query performance
The Problem
7
Concatenate Latitude and Longitude
Bad Idea! Must search entire range of one
dimension to find results
Solution 1
8
Solution 2: Z-Ordering
9
http://en.wikipedia.org/wiki/Z-order_curve
1 2
3 4
1 2
3 4
5 6
7 8
Easiest of Space-filling curves
Implemented by interleaving digits
from alternating dimensions
Lon -76.8 Lat 35.4
-+736584
Range Search with Z-order
10
Must filter out results that aren’t actually inside range ( Accumulo Iterator! )
Avoids full table scan / one dimension scan
Find z-ordering of two corners of search box. Scan from 1 to the other.
Z-Order on Lon / Lat
11
How Many Digits of Latitude?
12
39.123092 690 Miles
69 Miles
6.9 Miles
1200 Yards
120 Yards
36 Feet
3.6 Feet
4 inches
Developed for http://geohash.org
Concatenating Lon/Lat in a URL is patented!
A geohash defines an area from successive
bisections of areas of the earth.
Longitude Example:
Solution 3: Geohash
13
Longitude Geohash Example
14 Bisect the earth, choose which side of bisection point lies on,
use that bit as first bit for longitude geocode ( 1 in this case )
1 0
Longitude Geohash Example
15 Bisect section containing point, rinse repeat.
Code is now: 10
1 0
Longitude Geohash Example
16 Bisect section containing point, rinse repeat.
Code is now: 101
1 0
Perform same operation with latitude.
This defines a box around the point to
arbitrary precision.
Point of interest is in a box, not in the
center of the box
Bits are interleaved Longitude first, then
base-32 encoded
( 57.64911,10.40744) = u4ruydqqvj
Final geocode
17
See Wikipedia for worked example
http://en.wikipedia.org/wiki/Geohash
Google Earth KML Demo
http://api.prezzibenzina.it/geohash.kml
Google Earth Visualization
18
Range Search Can Be The Same
19
Performance difference varies by location and search area
Can calculate geohashes of surrounding
areas of a point deterministically
Search those areas independently
Range / Area Search Alternative
20
dqcqzkd
dqcrp dqcqn
dqcqy dqcwb
dqcqx
dqcx0
dqcw8 dqcqw
Point: dqcqzkd
Find parent geohash: dqc
Find neighbor hashes:
dqcqn,dqcrp, etc
Get search ranges
dqcqn00 - dqcqnzz
dqcrp00 – dqcrpzz
…etc
Crank up a batch scanner
Search Expands From Center
Outward
21
Geohashes are areas, z-order lon/lat are
points
Geohash is in public domain
Geohash can more easily proximity search
Key Differences
22
Watch border regions in range searches
Across equator
Across prime meridian
On top of poles
Across base-32 encoding edges
What to watch
23