FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

11
FastBit for Allele FastBit for Allele Data Data Dave Matthews Dave Matthews USDA-ARS, Cornell University USDA-ARS, Cornell University Ithaca, NY Ithaca, NY 10 April 2012 10 April 2012

Transcript of FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

Page 1: FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

FastBit for Allele DataFastBit for Allele Data

Dave MatthewsDave MatthewsUSDA-ARS, Cornell UniversityUSDA-ARS, Cornell University

Ithaca, NYIthaca, NY

10 April 201210 April 2012

Page 2: FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

A Lightning-Fast Index Drives Massive Data Analysis

http://www.scidacreview.org/0904/html/fastbit.html

FastBit significantly improves the speed of a searching operation onboth high- and low-cardinality values with a number of techniques,including a vertical data organization, an innovative bitmap compressiontechnique, and several new bitmap encoding methods...The ability to index high-cardinality data is unique to FastBit and isnot supported by other bitmap indexing methods.

Page 3: FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

Allele Data Variables

Allele = f(Marker, Line, Experiment)Size:

10^9 10^4 10^4 10^1

Cardinality:

2 = = =

Page 4: FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

Bitmap Indexing

Page 5: FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

The FastBit Technologies

1. vertical data organization

= 'vertical partitioning'. Only a few of the

(hundreds of) variables in each partition.

2. bitmap compression: Word-Aligned Hybrid Compression

3. two-level bitmap encoding

Page 6: FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

Word-aligned Hybrid Compression

• run-length encoding• 31-bit groups

Page 7: FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

Two-level Bitmap Encoding

• Approximate solution, then refine.

• Bin the values into groups, e.g. A to G, H to P, Q to Z.

• Encode the bin identifiers as bitmap.

• Encodings: equality, range, interval.– Interval has half the number of bitmap indexes.

• Multicomponent encoding: Bin the bins to reduce number of bitmap indexes.

• Multi-level encoding: hierarchy of bins, coarse to fine. Use interval encoding for coarse, equality for fine.

Page 8: FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

Indexing Bin Identifiers

Page 9: FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

Querying on more than one variable

FastBit performs extremely well on multi-variable queries because the intersection between the search results on each variable is a simple AND operation over the resulting bitmaps.

Page 10: FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

Performance

Page 11: FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012.

Instructions

http://crd-legacy.lbl.gov/~kewu/fastbit/doc/quickstart.html