Dan Han and Eleni StrouliaUniversity of [email protected]://ssrg.cs.ualberta.ca
104/12/23 Cloud 2013
04/12/23 2Cloud 2013
The General Research ProblemThe Geospatial Problem Instance
The Data Set HBase data-organization alternatives Performance analysis
Some Lessons Learned
04/12/23 3Cloud 2013
04/12/23 4Cloud 2013
04/12/23 Cloud 2013 5
04/12/23 6Cloud 2013
Appropriate data models for time-series (MESOCA 2012) Geospatial (CLOUD 2013)applications
In progress: spatio-temporal applications
04/12/23 7Cloud 2013
04/12/23 9Cloud 2013
04/12/23 10Cloud 2013
[1] built a multi-dimensional index layer on top of a one-dimensional key-value store HBase to perform spatial queries.
[2] presented a novel key formulation schema, based on R+-tree for spatial index in HBase.
Focus on row-key designno discussion about column and version design
04/12/23 11
[1] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi: MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. Mobile Data Management (1) 2011: 7-16[2] Ya-Ting Hsu, Yi-Chin Pan, Ling-Yin Wei, Wen-Chih Peng, Wang-Chien Lee: Key Formulation Schemes for Spatial Index in Cloud Data Managements. MDM 2012: 21-26
Cloud 2013
Two Synthetic Datasets Uniform and ZipF distribution Based on Bixi dataset, each object includes
▪ station ID, ▪ latitude, longitude, station name, terminal name, ▪ number of docks▪ number of bikes
100 Million objects (70GB) in a 100km*100km simulated space
04/12/23 12Cloud 2013
Regular Grid Indexing Row key: Grid rowID Column: Grid columnID Version: counter of Objects Value: one object in JSON format
04/12/23 13
Coun
ter
Column ID
Row
ID
00 01 02 03
00
01
02
03
Cloud 2013
Tie-based quad-tree Indexing Z-value Linearization Rowkey: Z-value Column: Object ID Value: one object in JSON Format
04/12/23 14
Z-Value
Object IDZ-value
Cloud 2013
Quad-Tree data model More rows with deeper
tree Z-ordering linearization
(violates data locality) In-time construction vs.
pre-construction implies a tradeoff between query performance and memory allocation
Regular Grid data model Very easy to locate a
cell by row id and column id
Cannot handle large space and fine-grained grid because in-memory indexes are subject to memory constraints
04/12/23 15
How much unrelated data is examined in a query matters a lot!
Cloud 2013
04/12/23 16
Obj
ect Att
ribu
te
Columnid-ObjectId
QTId
-Row
Id
A A A
A A A
A A A
B B B
B B B
B B B
C C C
C C C
C C C
D D D
D D D
D D D
00
0111
10
01 02 03 01 02 03
Space
Cloud 2013
04/12/23 17Cloud 2013
The row key is the QT Z-value + the RG row index.
The row key is the QT Z-value + the RG row index.
The column name is the RG column and the object-ID
The column name is the RG column and the object-ID
The attributes of the data point are stored in the third dimension.
The attributes of the data point are stored in the third dimension.
1. Compute minimum bounding square based on the query input location and the range
2. Compute the quad-tree tiles that overlap with the bounding square Z-codes
3. Compute all the regular-grid cells indexes in these quad-tree tiles the secondary index of rows and columns
4. Issue one sub-query for each selected tile of the quad-tree; process with user-level coprocessors on the HBase regions
5. Collect the results of the sub-queries at the client-side
04/12/23 18Cloud 2013
04/12/23 20Cloud 2013
04/12/23 21
00
02
04
06
00
02
04
06
Cloud 2013
04/12/23 22
00
02
04
06
00
02
04
06
09-00
09-04
Cloud 2013
1. Estimate the search range (density-based range estimation)
2. Compute indices of rows and columns (steps 2 and 3 of Range Query)
3. Issue a scan query to retrieve the relevant data points
4. If fewer than K data points are returned, re-estimate the search range and repeat steps 2-3
5. Sort the return set in increasing distance from the input location
04/12/23 23Cloud 2013
Experiment Environment A four-node cluster on virtual machines with
Ubuntu on OpenStack Hadoop 1.0.2 (replication factor is 2), HBase 0.94 HBase Configuration
▪ 5K Caching Size▪ Block cache is true▪ ROWCOL bloom filter
Query processing Implementation Native java API User-Level Coprocessor Implementation04/12/23 24Cloud 2013
The granularity of grid affects query-processing performance
Explore the “best” cell configuration of each model Quad-tree=>(t= 1) RG=>(t=0.1) HGrid=>(T=10,t=0.1)
04/12/23 25Cloud 2013
HG:≈10:0.1 fewer sub-queries more false positives
HG:≈1:0.1 more sub-queries fewer false positives
HG:≈10:0.01 more rows
HG:≈10:0.1 fewer rows
04/12/23 26
Given a location and a radius, Return the data points, located within
a distance less or equal to the radius from the input location
Cloud 2013
Given the coordinates of a location,Return the K points nearest to the
location
04/12/23 27Cloud 2013
04/12/23 28Cloud 2013
04/12/23 29Cloud 2013
Data Organization Short row key and column name Better to have one column family and few columns Not large amount of data in one row Row key design should ease pruning unrelated data 3rd dimension can store data as well Bloom Filter should be configured to prune rows and
columns Compression can reduce the amount of data
transmission
04/12/23 30Cloud 2013
Query Processing Scanned rows for one query should not exceed the
scan cache size, otherwise, split the query into sub-queries.
“Scan” is better than “Get” for retrieving discontinuous keys, even though the unrelated data
“Scan” for small queries, while Coprocessor for large queries
Better to split one large query into multiple sub-queries than use one query with row filter mechanism
04/12/23 31Cloud 2013
Benefits from the good locality of the RG index; suffers from the poor locality of the z-ordering QT linearization Performance could be improved with other linearization
techniques Can be flexibly configured and extended
The QT index can be replaced by the hash code of each sub-space
The granularity in the second stage can be varied from sub-space to sub-space based on the various densities
Is more suitable for homogeneously covered and discontinuous spaces
04/12/23 32Cloud 2013
A Data Model for spatio-temporal dataset
Towards a General Systematic Guidance for Column Families and other NoSQL databases
To apply the data model into cloud-based applications and big data analytics system
04/12/23 33Cloud 2013
Top Related