Bitmap Indices for Speeding Up End User Physics Analysis

36
Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated with: Institute of Computer Science and Business Informatics, University of Vienna, Austria

description

Bitmap Indices for Speeding Up End User Physics Analysis. Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated with: Institute of Computer Science and Business Informatics, University of Vienna, Austria. Outline. - PowerPoint PPT Presentation

Transcript of Bitmap Indices for Speeding Up End User Physics Analysis

Page 1: Bitmap Indices for Speeding Up End User Physics Analysis

Bitmap Indices for Speeding Up End User Physics Analysis

Main Results of Ph.D. Thesis

Kurt StockingerDatabase Group, IT-Division, CERN

Formerly affiliated with:

Institute of Computer Science and Business Informatics,

University of Vienna, Austria

Page 2: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 2

Outline Brief Overview of Index Data Structures Conventional Bitmap Indices:

Simple Bitmap Indices Bitmap Encoding Techniques Bitmap Compression

Bitmap Indices for Scientific Data A Novel Bitmap Algorithm Towards a Cost Model for a Query Optimiser

Features of My Bitmap Index Implementation Performance Benchmarks on Synthetic Data:

Verbatim Bitmap Indices Compressed Bitmap Indices

Performance Benchmarks on Real Data: High Energy Physics Sloan Digital Sky Server

Conclusions

Page 3: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 3

Brief Overview of Index Data Structures One dimensional index data structures:

Total order for one-dimension Hash-based:

Optimised for exact match queries, e.g. jetE = 106 Tree-based:

Optimised for range queries, e.g. jetE < 106 Most widely used: B+-tree (1972):

Multidimensional index data structures No total order for all dimensions Hash-based:

Grid-File, Bang-File, … Tree based:

R-Trees, Pyramid-Tree, … Bitmap Indices:

Applied in Data Warehouses for typical read-only environments

Page 4: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 4

Simple Bitmap Indices (Equality Encoding)

a) List of attributes b) Bitmap Index (equality encoding)

a) List of 12 attributes with 10 distinct attribute values, i.e attribute cardinality = 10

b) For each distinct attribute value, one bit slice is created, i.e bitmap index consists of 10 bit slices (E0 to E9)

Bit Slice E2 encodesattributes with value 2

Page 5: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 5

Various Bitmap Encoding Techniques

a) list of attributes b) equality encoding c) range encoding

Attribute cardinality = 10

Range encoding optimised for one-sided range queries, e.g. a0 <= 2

Page 6: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 6

Equality (EE) vs Range Encoding (RE)

Query typeEE RE

exact match query 1 2one-sided range query |A|/2 1

#bit slice scans

Index size: |A| bit slices

where |A| is the attribute cardinality, i.e. number of distinct attribute values

One-sided range queries can be more efficiently handled with range encoded bitmap indices!

Page 7: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 7

Pros and Cons of Bitmap Indices Pros:

Easy to build and to maintain Easy to identify records that satisfy a complex multi-

attribute predicate (multi-dim. ad-hoc queries) Very space efficient for attributes with low cardinality

(number of distinct attribute values, e.g. “Yes”, “No”)

Cons: Space inefficient for attributes with high cardinality A possible solution: Bitmap Compression

Page 8: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 8

Bitmap Compression

Advantage: Less disk space for storing indices Indices can be read from disk faster into memory More indices can be cached in memory

Possible problems: Difficult to combine bitmap compression with optimal

index design reported in the literature If bitmaps must be decompressed before performing

Boolean operations, the decompression overhead might outweigh the advantages of compression

Page 9: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 9

Various Bitmap Compression Algorithms

Run Length Encoding (RLE): one-sided (asymmetric) vs. two-sided (symmetric)

Gzip (Lempel-Ziv, LZ): verbatim (uncompressed) bitmap is compressed via zlib

ExpGol: variable bit length encoding (RLE-bitmap is compressed)

Byte-Aligned Bitmap Compression (BBC): variable byte length encoding (Oracle patent) one-sided vs. two-sided (BBC1 vs. BBC2)

Page 10: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 10

Algorithms for Boolean Operations on Compressed Bitmaps

[Johnson VLDB99]

Basic: Input (I): two verbatim bitmaps Output (O): one verbatim bitmap

Inplace: I: one verbatim bitmap + one RLE, ExpGol or BBC-

bitmap O: one verbatim bitmap

Direct: I: two compressed bitmaps (RLE or BBC) O: one compressed bitmap (RLE or BBC)

Page 11: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 11

Outline Brief Overview of Index Data Structures Conventional Bitmap Indices:

Simple Bitmap Indices Bitmap Encoding Techniques Bitmap Compression

Bitmap Indices for Scientific Data A Novel Bitmap Algorithm Towards a Cost Model for a Query Optimiser

Features of My Bitmap Index Implementation Performance Benchmarks on Synthetic Data:

Verbatim Bitmap Indices Compressed Bitmap Indices

Performance Benchmarks on Real Data: High Energy Physics Sloan Digital Sky Server

Conclusions

Page 12: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 12

Bitmap Indices for Scientific Data

Bitmaps indices of commercial products (Oracle, Sybase, Informix) are optimised for discrete attribute values, e.g. integers

However, scientific data is mostly non-discrete, e.g. floating points

Using commercial bitmap indices for non-discrete values would produce one bit slice per distinct attribute value!

Possible solutions: Build function-based indices on top of commercial indices:

See evaluation of DB-Group on Qracle’s bitmap indices However, Oracle uses equality encoded bitmap indices (not

optimised for range queries)! Develop your own range-based bitmap indices (topic of my

Ph.D. thesis)

Page 13: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 13

Range Encoding for Non-Discrete Attribute Values Encoding of attribute ranges [0;140) rather than

attribute values (7 logical but 6 physical bins)

Query processing: see next slide

Page 14: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 14

A Novel Bitmap Algorithm -GenericRangeEncoding Extract candidate objects from “candidate slice”

via XOR with “previous” bit slice for query: x < 63

XOR

Hits objectsOnly these candidates need to be checked rather than all candidates in the “candidate slice” Result after

“candidate check”

Page 15: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 15

Towards a Cost Model for a Query Optimiser

Basic Idea: Before a query is executed the Query Optimiser calculates the

I/O costs for both access paths, namely the sequential scan and the query based on the bitmap index

Given these costs, the Query Optimiser selects the access paths with the lowest expected costs (cost-based Query Optimiser).

Approach for Cost Model based on GenericRangeEncoding: Given the query range and the binning strategy, calculate the

expected I/O costs for checking the candidate objects against the query constraint

Use stochastic model Note: We do not attempt to discuss the whole approach. For

details refer to http://kurts.home.cern.ch/kurts/research/diss.ps

Page 16: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 16

Cost Model #1:#Candidates per Dimension For discrete attribute values the main bottleneck

is the “index scan” For non-discrete attribute values the main

bottleneck is the “candidate check”, i.e. all candidate objects must be checked against the query constraint

Simplifying assumption: equally distributed and independent data values Max. number of expected candidates (Ec) per indexed

attribute: Ec = O/b where O … #total_objects, b … #bit_slices

e.g. 1,000,000 objects with 100 bins => 10,000 candidate objects

Page 17: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 17

Cost Model #2: Page I/O for Candidates per Dimension

Access granularity of database is one page rather than one object

Thus, if one object is accessed, the whole page is read

Costs for page I/O [O’Neil, Quass 1997]: C = ptot*[1-e^(-Ec/ptot)] where ptot … total #pages of all

objects Ec … expected #candidate

objects

Page 18: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 18

Outline Brief Overview of Index Data Structures Conventional Bitmap Indices:

Simple Bitmap Indices Bitmap Encoding Techniques Bitmap Compression

Bitmap Indices for Scientific Data A Novel Bitmap Algorithm Towards a Cost Model for a Query Optimiser

Features of My Bitmap Index Implementation Performance Benchmarks on Synthetic Data:

Verbatim Bitmap Indices Compressed Bitmap Indices

Performance Benchmarks on Real Data: High Energy Physics Sloan Digital Sky Server

Conclusions

Page 19: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 19

My Bitmap Indices

Bitmap Indices are built on top of Objectivity/DB Single Bit Slices are based on new version of HepODMBS

Tags: Persistent, scalable segmented VArrays called “sliced Tag”

(column-wise clustering, see next slide) Prefetch optimisation for concurrent reading

“Base objects”, i.e. non-indexed data, are also stored as sliced Tag

Query Preprocessor: with Koen Holtman (Caltech/CMS): “any” mathematical (query)

expression can be evaluated E.g. Bitmaps “jet1E < 3.7 && sin(jet2Phi) > 0.3 && jet2E > 5.5”

Bitmap Compression: with Theodore Johnson (AT&T Labs-Research) – [VLDB99/00] +

own enhancements of Boolean operations for two-sided BBC

Page 20: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 20

Clustering of Generic vs. Sliced Tags in HepODBMS

attr1attr2attr3

attr1 attr1attr1attr2attr2attr2

attr3 attr3attr3

a1a1a1a1 a2a2a2a2 a3a3a3a3

GenericTags (PAW:row-wise)

SlicedTags (PAW:column-wise)

tag0 tag1 tag2 tag3

“old” version

“new” version:not released yet

Page 21: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 21

Outline Brief Overview of Index Data Structures Conventional Bitmap Indices:

Simple Bitmap Indices Bitmap Encoding Techniques Bitmap Compression

Bitmap Indices for Scientific Data A Novel Bitmap Algorithm Towards a Cost Model for a Query Optimiser

Features of My Bitmap Index Implementation Performance Benchmarks on Synthetic Data:

Verbatim Bitmap Indices Compressed Bitmap Indices

Performance Benchmarks on Real Data: High Energy Physics Sloan Digital Sky Server

Conclusions

Page 22: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 22

Definitions and Assumptions for Verbatim Bitmap Indices First set of tests is based on 1,000,000 base objects with

25 attributes (dimensions) Attributes are clustered together (sliced Tag alias column-

wise clustering) Attribute values are equally distributed and independent,

and in the range of [0;100] Bitmap Index (BMI):

100 equi-width bins per dimension => Size of BMI ~3 times the size of the base objects

Query selectivity per attribute (dimension): #selected_attribute_values/#total_attribute_values (per

dimension) e.g. a3 < 30 => 30 % selectivity

Total query selectivity: #selected_objects/#total_objects e.g. a3 < 30 && a7 > 40 => 12 % selectivity

Page 23: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 23

5-Dimensional Query - Page I/O & Response Time

Total query sel. = x5

sequential scan

Max. speed up of BMIrelative to seq. scan:

~ factor 2

Note: All benchmarks in this talk are performed on cold disk cache!

Page 24: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 24

10-Dimensional Query - Page I/O & Response Time

Total query sel. = x10

sequential scan

Max. speed up of BMIrelative to seq. scan:

~ factor 3

Page 25: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 25

25-Dimensional Query - Page I/O & Response Time

Total query sel. = x25

sequential scan

Max. speed up of BMIrelative to seq. scan:

~ factor 5

Page 26: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 26

Assumptions for Compressed Bitmap Indices

1,000,000 base objects with 25 attributes (dimensions)

Attribute values are exponentially distributed and independent

Bitmap Index (BMI): 100 equi-width bins per dimension => Size of BMI ~3 times the size of the base objects

Page 27: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 27

2-Sided Byte Aligned Bitmap Compression (BBC2)

Exponential data distributionGood compression ratio

Range Encoded Bitmap Index

Page 28: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 28

Verbatim vs Compressed (BBC2) Bitmap Indices

Advantage of compressed bitmap index

Page 29: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 29

Outline Brief Overview of Index Data Structures Conventional Bitmap Indices:

Simple Bitmap Indices Bitmap Encoding Techniques Bitmap Compression

Bitmap Indices for Scientific Data A Novel Bitmap Algorithm Towards a Cost Model for a Query Optimiser

Features of My Bitmap Index Implementation Performance Benchmarks on Synthetic Data:

Verbatim Bitmap Indices Compressed Bitmap Indices

Performance Benchmarks on Real Data: High Energy Physics Sloan Digital Sky Server

Conclusions

Page 30: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 30

Specific HEP Data

Physics data: 1,401,020 Tags with 37 attributes (in Objectivity)

Data Size: 262 MB Index Size: 790 MB (37 dimensions with 100 bins

each)

Page 31: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 31

Distribution Functions of Specific HEP Data

Data Distribution4 different physics attributes

Range Encoded BMIs with 100 bins

Page 32: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 32

BMI Results for Specific HEP Data

For the particular queries we studied we got a performance improvement of a factor of two for 10-dimensional queries (as compared to the sequential scan) based on bitmap indices with 100 bins (~3 times the size of base objects)

Tests based on real data with synthetic queries However, as we have seen all the results are relative and

highly depended on:a) Data distributionb) Access patternsc) Binning strategy – which should reflect a) and b)

For higher dimensional queries the performance improvement can be even more significant!

Page 33: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 33

Specific Sloan Digital Sky Server (SDSS) Data

Sloan Digital Sky Server: 6,182,527 real astronomy objects (on top of Objectivity)

Extraction of these objects and porting to sliced tags with bitmap indices

In total: 65 bitmap indices (one index for each attribute)

Data size (base objects): ~2 GB Index size: ~5.2 GB

Page 34: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 34

SDSS Sample Queries

From 357 query logs of 41 users, 49 queries based on this data set (sxGalaxy).

3 typical multi-dimensional ones:

Q1: SELECT g,r,I FROM sxGalaxyWHERE ((RA() between 180 and 185) && (DEC() between 1. and 1.2) && (r between 10 and

18)&& (i between 10 and 18) && (g between 10 and 18))

Q2: SELECT g,r,i FROM sxGalaxyWHERE ((g-r between 1.05 and 1.13) &&(r-i between 0.42 and 0.51) && (r between 15.68

and 19.68))

Q3: SELECT u,g,r FROM sxGalaxyWHERE ((u-g between 0.0 and 0.75) && (g-r between 0.0 and 0.5) && (u between 18 and 23) && (g between 18 and 23) && (r between 18 and 23) && ((u-g)/(g-r) between 0.8 and 1.2))

Page 35: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 35

BMI Results for Specific SDSS Data

Speedup factor of queries against bitmap indices over queries against Sloan Sky Server: Q1: speedup factor ~10 Q2: speedup factor ~20 Q3: speedup factor ~15

Reason for better performance of bitmap indices: Better clustering of base objects - attribute-wise rather

than object-wise Low selectivity queries require fewer page I/Os than

Sloan Queries

Page 36: Bitmap Indices for Speeding Up End User Physics Analysis

February 6, 2002 [email protected] 36

Conclusions

Depending on the data distribution, the query access pattern and the binning strategy, bitmap indices can significantly improve the response time of high-dimensional queries

Detailed results can be found in Ph.D. thesis:http://kurts.home.cern.ch/kurts/research/diss.ps

Future work: Collaboration with Arie Shoshani and John Wu from LBNL @

Berkeley to further improve query response time & bitmap compression

Improve Cost Model for Query Optimiser to increase accuracy of predictions of I/O costs for queries against real data with various binning strategies