Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

22
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring *The Ohio State University Los Alamos National Laboratory

description

Indexing and Parallel Query Processing Support for Visualizing Climate Datasets. Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University † Los Alamos National Laboratory. Outline. Motivation and Introduction Background System Overview and Optimization Experiment - PowerPoint PPT Presentation

Transcript of Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

Page 1: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Indexing and Parallel Query Processing Support for Visualizing

Climate Datasets

Yu Su*, Gagan Agrawal*, Jonathan Woodring†

*The Ohio State University†Los Alamos National Laboratory

Page 2: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Outline• Motivation and Introduction• Background• System Overview and Optimization• Experiment• Conclusion

Page 3: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Motivation

• Science becomes increasingly data driven;• Strong desire for efficient data visualization;• Challenges:

– Fast data generation speed– Slow disk IO and network speed – Worse performance during visualization– Different kinds of subsetting requests

• Difficult and Unnecessary to visualize all the data

Page 4: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Data Subsetting in Paraview• A widely used data analysis and visualization

application• Problems: Load + Filter mode

– Load the entire data set– Data filtering in visualization level

• Threshold Filter: based on values• Extract Subset Filter: based on dimension info

– Grid transformation needed during filtering• Regular Structured Grid -> Unstructured Grid

Page 5: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

A Faster Solution• Subset at the I/O level

– User specifies the subset in one query for both dimension and value ranges

– Reduced I/O time and memory footprint• SQL queries in ParaView

– Query over Dimensions – API support– Query over Values - Indexing

• Bitmap Indices and Parallel Bitmap Indices– Efficient subsetting over values

Page 6: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Background: Bitmap Indexing• Fastbit: widely used in Scientific Data Management

• Suitable for float value for binning small ranges• Run Length Compression(WAH, BBC)

– Compress bitvector based on continuous 0s or 1s

Page 7: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Bitmap Index and Dim Subset• Run-length Compression(WAH, BBC)

– Good: compression rate, fast bitwise operation;– Bad: ability to locate dim subset is lost;

• Two traditional methods: – With bitmap indices: post-filter on dim info;– Without bitmap indices: post-filter on values;

• Two-phase optimization: – Index Generate: Distributed Indices over sub-

blocks;– Index Retrieval: Transform dim subsetting info into

bitvectors, and support fast bitwise operation;

Page 8: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

System Overview

Parse the SQL expression

Parse the metadata file

Generate Query Request

Index Generation if not generated; Index Retrieving after that.

Page 9: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Optimization 1: Distributed Index Generation

Study relationship betweenQueries and Partitions.

Partition the data based onQuery Preference

Page 10: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Index Partition Strategy• α rate: Participation rate of data elements

– Number of elements in indexing / Total data size– Worst: All elements have to be involved – Ideal: Elements exact the same as dim subset

• Partition Strategies: – Strategy 1: α is proportional to dim subsetting percentage and inversely

proportional to number of partitions.

– Strategy 2: In general cases where subsetting over each dimension has a similar probability, the partition should have equal preference over each dim.

– Strategy 3: If queries only include a subset of dims, the partition should also be based on these dims.

Page 11: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Optimization 2: Index Retrieval

Post-filter?

Page 12: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Parallel Index Architecture

L3: data block

L1: data file

L2: variable

Page 13: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Experiment Setup• Goals:

– SQL subsetting vs. Load + Filter in Paraview– Scalability of parallel indexing method– Indexing and Partition Strategy vs. FastQuery

• Dataset: – Parallel Ocean Program– Data size: 33.6 GB– Data format: NetCDF(array based)

• Environment: – IBM Xeon Cluster 8 cores, 2.53GHZ– 12 GB memory

Page 14: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Efficiency Comparison with Filtering in Paraview

• Data size: 5.6 GB• Input: 400 queries• Depends on subset

percentage• General index method is

better than filtering when data subset < 60%

• Two phase optimization achieved a 0.71 – 11.17 speedup compared with filtering method

Index m1: Bitmap Indexing, no optimizationIndex m2: Use bitwise operation instead of post-filteringIndex m3: Use both bitwise operation and index partition Filter: load all data + filter

Page 15: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Memory Comparison with Filtering in Paraview

• Data size: 5.6 GB• Input: 400 queries• Depends on subset

percentage• General index method has

much smaller memory cost than filtering method

• Two phase optimization only has small extra memory cost

Index m1: Bitmap Indexing, no optimizationIndex m2: Use bitwise operation instead of post-filteringIndex m3: Use both bitwise operation and index partition Filter: load all data + filter

Page 16: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Scalability with Different Proc#

• Data size: 8.4 GB• Proc#: 6, 24, 48, 96• Input: 100 queries• X pivot: subset percentage• Y pivot: time• Each process take care of

one sub-block• Good scalability as

number of processes increases

Page 17: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Alpha Rate with Different Proc#

• Data size: 8.4 GB• Proc#: 6, 24, 48, 96• Input: 100 queries• X pivot: subset percentage• Y pivot: Alpha Rate• More number of processes

means more index partitions

• Good participation rate when selecting a smaller percentage data subset

Page 18: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Alpha Rate and IO Access Times Comparison with FastQuery

• FastQuery: • Build relational table view over scientific dataset• Difference: doesn’t consider multi-dimension data features

• Data size: 8.4 GB, 48 processes• Query Type: value + 1st dim, value + 2nd dim, value + 3rd dim, overall• Input: 100 queries for each query type

Page 19: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Efficiency Comparison with FastQuery

• Data size: 8.4 GB• Proc#: 48• Input: 100 queries for each

query type• Achieved a 1.41 to 2.12

speedup compared with FastQuery

Page 20: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Page 21: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Conclusion

• Big data issue in data analysis and visualization• Find exact data subset in IO level with SQL

interface and bitmap indexing• A good speedup compared with filtering method• Data partition strategy and parallel indexing• A good speedup compared with FastQuery

Page 22: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012 22

Thanks