Support Data Sampling Using Bitmap Indices over Scientific Dataset
description
Transcript of Support Data Sampling Using Bitmap Indices over Scientific Dataset
![Page 1: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/1.jpg)
Support Data Sampling Using Bitmap Indices over Scientific Dataset
Yu Su*, Gagan Agrawal*, Jon Woodring†
*The Ohio State University†Los Alamos National Lab
![Page 2: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/2.jpg)
Outline
• Motivation and Introduction• Background• System Overview• Index Sampling and Optimizations• Experiment Results• Conclusion
![Page 3: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/3.jpg)
Motivation• Science becomes increasingly data driven;• Strong requirement for efficient data analysis;• Challenges:
– Fast data generation speed– Slow disk IO and network speed – Some number from road-runner EC3 simulation
• 40003 particles, 36 bytes per particle => 2.3 TB/time• 10GB/s • 230 times different, and bigger in future
• Extremely hard to analyze or visualize entire data
![Page 4: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/4.jpg)
Existing Data Management Methods
SimpleRequest
AdvancedRequest
Challenges?
No subsetting request?
Data subset still big?
Server-side SubsettingClient-side Subsetting
![Page 5: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/5.jpg)
Server-side Data Sampling• Statistic Sampling Techniques:
– sampling is concerned with the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population.
• Examples: – Simple Random Sampling– Stratified Random Sampling
• Information Loss is Unavoidable• Error Metrics:
– Mean, Variance – Histogram– QQPlot
![Page 6: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/6.jpg)
Data Sampling Challenges• Challenges in Scientific Data Management:
– Data Accuracy. Fail to consider data features.• Data Value Distribution• Data Spatial Locality
– Error Calculation is time-consuming.– Can’t support sampling over flexible data subset– Data has to be reorganized
• Bitmap indexing has been widely used– Support efficient subsetting over values– Fastbit, FastQuery, our ICPP work
![Page 7: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/7.jpg)
Our Solution• A server-side subsetting and sampling framework.
– Standard SQL interface– Data Subsetting: Dimensions, Values
• TEMP(longitude, latitude, depth) ;– Flexible sampling mechanism
• Support Data Sampling over Bitmap Indices– No data reorganization is needed– Generate an accurate error metrics result– Support Error Prediction before sampling the data– Support data sampling over flexible data subset
![Page 8: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/8.jpg)
Background: Bitmap Indexing• Widely used in Scientific Data Management
• Suitable for float value for binning small ranges• Run Length Compression(WAH, BBC)
– Compress bitvector based on continuous 0s or 1s
![Page 9: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/9.jpg)
System Architecture
![Page 10: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/10.jpg)
Data Sampling Using Bitmap Indices
• Features: – Different bitvectors reflect the value distribution;– Each bitvector keep the data locality;
• Row major, Column major• Hilbert Curve, Z-order Curve
• Method:– Perform stratified sampling within each bitvector;– Multi-level indexing generates multi-level samples;
![Page 11: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/11.jpg)
Stratified Sampling over Bitvectors
S1: Index Generation
S2: Divide Bitvector into Equal Strides
S3: Random Select certain % of 1’s out of
each stride
![Page 12: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/12.jpg)
Error Prediction
• Calculate errors based on bins instead of samples– Indices classifies the data into bins;– Each bin corresponds to one value or value range;– Find a represent value for each bin: Vi;– Equal probability is forced for each bin;– Compute number of samples within each bin: Ci;– Predict error metrics based on Vi and Ci;
• Represent Value: – Small Bin: mean or median value– Big Bin: lower-bound, upper-bound, mean value
![Page 13: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/13.jpg)
Error Prediction Metadata
MeanVarianceHistogramQQPlot
Mean, Variance over Strides
![Page 14: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/14.jpg)
Error Prediction Formula (1)• Mean, Variance:
• Histogram:
![Page 15: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/15.jpg)
Error Prediction Formula (2)• QQPlot
![Page 16: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/16.jpg)
Data Subsetting + Sampling
S3: Perform Sampling on Subset
S2: Find Spatial ID subset
S1: Find value subset Val = 1.2
ID = (11, 21)
![Page 17: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/17.jpg)
Multi-attributes Subsetting and Sampling Support
S3: Generate Bitmap Indices based on mbins
S2: Combine Single Value Intervals to mbins
S1: Generate Value Interval for each attribute
![Page 18: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/18.jpg)
Experiment Setup• Environment:
– Darwin Cluster: 120 nodes, 48 cores, 64 GB memory• Dataset:
– Ocean Data – Regular Multi-dimensional Dataset– Cosmos Data – Discrete Points with 7 attributes
• Sampling Method: – Simple Random Method– Simple Stratified Random Method– KDTree Stratified Random Method– Big Bin Index Random Method– Small Bin Index Random Method
![Page 19: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/19.jpg)
Experiment Goals
• Two Applications after Sampling: – Data Visualization - Paraview– Data Mining - K-means in MATE
• Goals: – Efficiency and Accuracy with and without sampling– Accuracy between different sampling methods– Efficiency between different sampling methods– Compare Predicted Error with Actual Error – Speedup for sampling over data subset
![Page 20: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/20.jpg)
Efficiency and Accuracy of Sampling over Ocean Data
• Data size: 11.2 GB TEMP• Network Transfer Speed: 20 MB/s• Speedup compared to original dataset: 25% - 1.87; 12.5% - 3.72; 1% - 10.97; 0.1% - 31.62;
• Error Metrics: Variances over Strides• Value diffs between original and samples• Information Loss Percent: 25% - 0.39%; 12.5% - 0.56%; 1% - 0.91%; 0.1% - 1.18%;
![Page 21: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/21.jpg)
Efficiency and Accuracy of Sampling over Cosmos Data
• Data size: 16 GB (VX, VY, VZ)• Network Transfer Speed: 20 MB/s• Speedup compared to original dataset: 25% - 2.11; 12.5% - 4.30; 1% - 21.02; 0.1% - 60.14;
• Kmeans: 20 clusters, 3 dims, 50 iterations• MATE: 16 threads • Error Metrics: Means of cluster centers• Much better than other methods
![Page 22: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/22.jpg)
Absolute Mean Value Differences over Strides – 0.1%
![Page 23: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/23.jpg)
Absolute Histogram Value Differences – 0.1%
![Page 24: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/24.jpg)
Absolute QQPlot Value Differences – 0.1%
![Page 25: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/25.jpg)
Data Sampling Time
• Data size: 1.4 GB• Our method: extra
striding cost• Compare: small bin
random cost 1.19 – 3.98 most time compared with KDTree random method
![Page 26: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/26.jpg)
Error Calculation Time
• Error Prediction: O(m)• Error Calculation
• QQPlot: O(slogs)• Others: O(s)
• Compare: Error Prediction achieved >28 times speedup compared with error calculation
![Page 27: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/27.jpg)
Total Time based on sampling times
• Depends on sampling times
• Comparison: Small bin methods achieved a 0.37 – 5.29 times speedup compared with KDTree random method
![Page 28: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/28.jpg)
Predicted Error vs. Actual Error
![Page 29: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/29.jpg)
Predicted Error vs. Actual Error
![Page 30: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/30.jpg)
Subsetting OptimizationSubset over Spatial IDsSubset over values
• Smaller Index Loading Time• Smaller Sampling Time• Speedup: 2.28 - 21.54 for small bin 2.25 - 13.56 for big bin
• Smaller Sampling Time• Speedup: 1.37 - 2.48 for small bin 1.67 - 3.02 for big bin
![Page 31: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/31.jpg)
Conclusion
• ‘Big Data’ issue brings challenges ;• Data sampling is necessary for data analysis;• Perform server-side sampling over bitmap indices;• Error Prediction and Sampling based on subset;• Achieve a good accuracy and efficiency.
![Page 32: Support Data Sampling Using Bitmap Indices over Scientific Dataset](https://reader035.fdocuments.in/reader035/viewer/2022062315/56816257550346895dd2a6c5/html5/thumbnails/32.jpg)
32
Thanks