July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
-
Upload
lindsay-logan -
Category
Documents
-
view
218 -
download
1
Transcript of July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
July, 2001
The big picture
gridstorage
MPI-IOfile
Request
Interpreter
dataset
Data mining
DistributedLarge
July, 2001
The big picture
Request interpreter
Logical request
Qualified objects
Request planning/execution
Execution services
grid
LBNL
PPDG
MPI-IO, …
Sub-task schedule
July, 2001
Problem statement
• Main objective: maps logical request to qualified objects
— a logical request:• 20001015<=eventTime & 200<energy<300 …
— objects:• set of object IDs;
• set of files containing the objects;
• offsets within the files, …
July, 2001
Requirements & Status
• General requirements
— User request data in terms of their scientific domain, not file names or offsets in files
— Each object may be described in hundreds of attributes
— Each request is in terms of range predicates on a handful of attributes (partial range query)
• Status
— Initially motivated by a HENP experiment: STAR
— Software originally developed under GC and is currently in use at BNL
July, 2001
Large high-dimensional datasets
• Number of attributes / columns: 200 – 500
• Number of objects / events: 108 – 109
• File containing one attribute: 400MB – 4GB
• Total size over all attributes: 80GB – 2TB
A1 A2 A3 A4 …Object ID0
1
2
.
.
.
109
108
.
.
.
Goal: develop an index, so that:
• Read as little as possible from disk
• Minimize computation in memory
Curse of dimensionality
July, 2001
Well known indexing methods
• B-tree based indices— One or a small number of attributes— Index size may be up to 3 times the data size
• R-tree based indices— Small number of attributes, say, < 10
• UB-tree— Use space filling curves to map high-dimensional
data to one-dimension— One range query is mapped into many many
queries on the B-tree based index• Even sequential scan
— Better than B-tree and R-tree if dimension > 10— Simply read all data and compare take too long
July, 2001
Another class of indexes: Bitmap index
• Example queries on the attribute, say, A
• One-sided range query: A < 2
— b0 OR b1
• Two-sided range query: 2<A<5
— b3 OR b4
• Basic steps of building a bitmap index
— Binning
— Encoding
— Compressing
Datavalues
015312041
100000100
010010001
000001000
000100000
000000010
001000000
=0 =1 =2 =3 =4 =5
b0 b1 b2 b3 b4 b5
July, 2001
How many bins?
Range(x)R
ange
(y)
Edge binEdge bin
.. ... ... ... ... ... .
.. ... ... ... ... ... ... ... ... ... .
.. ... .More bins
Less objects in edge bins
July, 2001
Advantages of bitmap indices
• Fast operations
— The most common operations are the bitwise logical operations
— They are well supported by hardware
• Easy to compress, potentially small index size
• Each individual bitmap is small and frequently used ones can be cached in memory
• Efficient for read-mostly data: data produced from scientific experiments can be appended in large groups
• Available in most major commercial DBMS
July, 2001
Why our own bitmap index
• Early tests shown that we can do an order of magnitude better than ORACLE (using equality encoding)
• Vertical partition: allows one to only read data of the attributes involved in a query
• New compression method
— Best known: Byte-aligned Bitmap Code (BBC)
— Developed 2 Word-Aligned Schemes: WAH, WBC
• Different encoding schemes under compression
— Equality encoding – used in ORACLE and others
— Range encoding – one-sided range queries
— Interval encoding – two-sided range queries
July, 2001
Information about the test machines
• Hardware and system
— Sun enterprise 450 (Ultrasparc II 400MHz)
— 4GB RAM
— VARITAS volume manager (stripped disk)
• Real application data from STAR
— Above 2 million objects
— Picked 12 attributes with varying distributions
• Measures:
— Logical operation time without IO
— Logical operation time with IO
— Query processing time
July, 2001
New compression schemes
• Overall, use about 50% more space than BBC
• On average, 12 times faster than BBC
• Faster than the uncompressed in more cases:
— New schemes are faster than the uncompressed scheme when the compression ratios are less than 0.3
— BBC is faster than the uncompressed when the compression ratios are less than 0.03
July, 2001
Sizes of bitmap indices
Conclusion:- equality encoding is most space efficient- Compression gain is at least a factor of 2.5
July, 2001
Average query processing time
Conclusion:- interval and range encoding are the best- For these cases, there is practically no penalty to compression
July, 2001
Summary
• Better compression scheme
— 50% more space, but 10-12 time faster !!!
• Among the different encoding schemes
— the interval encoding is the better than the equality encoding and the range encoding
• Selecting the number of bins => Bitmap index size and operation efficiency. For example:
— 10% of data size => 3 x speed of sequential scan
— 20% of data size => 6 x speed of sequential scan
• Equality encoding currently used in the STAR experiment. Next version will include the interval encoding.
July, 2001
Future work
• Support NULL value and categorical values
• On-line update: add new data and update index without interrupting request processing
• Recovery mechanism for robustness
• Potential new applications: climate, astrophysics, biology
• Study different non-uniform binning strategies
• Integrate with conventional database system: to better handle metadata, to provide more versatile front-end