Shenglei Zhang ﹡ , Jiancheng Shi, Youjun Dou Xiaojun Yin, Liying Li, Chenzhou Liu
Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University...
-
Upload
amos-jordan -
Category
Documents
-
view
215 -
download
0
Transcript of Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University...
Hongwei Zhao, Xiaojun Ye
MOLAP on Cloud
Interactive, Cluster Data Warehouse
Tsinghua University [email protected], [email protected]
MotivationExtend the cube model to support OLAP operations on Big Data:»OLAP operations»Interactive queries
OutlineCube modelling
Building and querying
Experimenting
Data Transform for CubeTPC-DS tables Star views Cube data
User queries
A Simplified Cube Model
Cube Instance
Cuboid InstanceDimension
Instance
DimensionInstance
CubeMetadata
DimensionInstance
Cuboid Instance
KeyMemb
erKeyMemb
erKeyDimensi
on Member
KeyMeasure NodeKeyMeasure NodeKeyMeasure Cell
ABC
AB
A
AC
B
BC
C
*
Base Cuboids
Result
Example: TPC-DS Query7select i_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 from store_sales, customer_demographics, date_dim, item, promotionwhere ss_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and ss_cdemo_sk = cd_demo_sk and ss_promo_sk = p_promo_sk and cd_gender = '[GEN]' and cd_marital_status = '[MS]' and cd_education_status = '[ES]' and (p_channel_email = 'N' or p_channel_event = 'N') and d_year = [YEAR] group by i_item_id order by i_item_id
Relation Schema
Store Sales
Date Dim
Item Promotion
Customer Demographic
s
Converting to BitKeyDimensio
n ADimension B
Dimension C
Measure
A1 B1 C1 M1
A2 B1 C2 M2
A3 B2 C2 M3
Dimension Member
BitKey Dimension Mask
A1 000001 000001
B1 000010 000010
C1 000100 000100A2 001000 001001
B1 000010 000010
C2 010000 010100
BitKeys
Value
000111 M1011010 M2Result2
Result1
Intermediate
Result1
Fact1
Fact2
Intermediate
Result1
Cube StorageTable
Region
ColumnFamily
Row
Column
Version
Value
Cell
One table for dimension instances storage:
Row Key Dimension Name
Column Family
Default
Column Member BitKey
Value Member ValueMultiple tables for cuboids instances
Table Name Cuboid Name
Row Key Cell BitKey
Column Family
Default
Column Measure Name
Value Measure Value
MDX for query 7select { i_item_id } on rows,
{ avg(ss_quantity), avg(ss_list_price),avg(ss_coupon_amt),
avg(ss_sales_price) } on columns
from store_sales_cubewhere (cd_gender .[Male], cd_marital_status .[Single], cd_education_status .[College],
d_year.[2000])
Cube Implementation
Base cuboid building with 4 stages:Dimension constructingHive queryAggregationSaving
Query execution with 4 stages:Loading dimensionOther cuboid constructingMappingReducing
OLAP System
Engin
eC
olu
mnar
Data
base
Master Node
Region NodeRegion
Node
Dispatcher Node
Worker Node
Region Node
Worker NodeWorker Node
cachedat
a
Cube data
Cluster FrameworkDispatcher Node
Worker Nodes
• Distribute dynamically cubes data onto worker nodes
• Parallelize OLAP operations into a concurrent model
Actor of AkkaStateBehaviorMailbox
Lifecycle
Fault tolerance
Execute Query
Query Dispatch
er
Cuboid Manager
Dimension
Manager
Mapper Reducer
1 2
34
require
Cuboid ready
Dimension load
data ready
Extract Query
Hit Cell
Hit Cell
Actors for Query
• Load dimension members
• Build other cuboids• Mapping• Reducing
Compiling & MappingQuery 7 Condition: GEN=M and MS=S and ES=College and YEAR=2000
GEN Mask: 000000011 Male 000000010MS Mask: 000011100 Single :000001100ES Mask: 001100000College: 001000000YEAR Mask: 110000000 2000:010000000
Mask: 111111111FilterKey: 011001110
Query Dispatch
er
Mapper1
Mapper2
Mapper3
For each cell in mapper{ If (key & mask
== Filter Key) Send to Reducer}
Region 1
Region 2
Region 3
Worker
Worker
Worker
Master
messages
results
Cache 1
Cache 2
Cache 3
Query Execution
• Master sends task messages to workers
• Each worker caches each region data
• Sequential tasks reuse the cache data
First query on 1G consume 48 secs, the following queries with various parameters consume 2.4 secs
Experiments On TPC-DS
1g 10g 100g0
50000000
100000000
150000000
200000000
250000000
300000000
fact recordscells
1G 10G 100G
records number
2,653,108
26,532,571
265,325,821
cube cell number
1,836,162
10,190,922
41,892,286
4 nodes:• 2*Intel Xeon CPU E5-2630• 4*600G 15000r/s SAS • 256G RAM• 10Gb Network
Dimensions:1. "i_item_id", 2. "cd_gender", 3. "cd_marital_status", 4. "cd_education_status", 5. "p_channel_email", 6. "p_channel_event", 7. "d_year“Measures: 8. ss_quantity_avg,9. ss_list_price_avg, 10. ss_coupon_amt_avg, 11. ss_sales_price_avg
Build Cube for Query 7
1G
10G
100G
0 1000 2000 3000 4000 5000 6000
queryingaggregatingSaving
running time (seconds)
TP
C-D
S d
ata
siz
e
• Partition by the largest Dimension(i_item_id)
• In-Memory aggregation• Saving stage can be
ignore(cache)
1 2 3 4 50
50
100
150
200
250
300
350
400
4 workers8 workers16 workers
iteration number
run
nin
g t
ime
(seco
nd
s)
Execute Query 7First execution on the cube includes • Dimension loading• other cuboids construction • Caching• Mapping• Reducing
Sequential execution includes:• Mapping• Reducing
Hive Query for Fact Data select p_channel_email, p_channel_event, cd_gender, cd_marital_status, cd_education_status, i_item_id,d_year, ss_quantity, ss_list_price, ss_coupon_amt, ss_sales_price from store_sales
join date_dim on (store_sales.ss_sold_date_sk
= date_dim.d_date_sk) join item on (store_sales.ss_item_sk =
item.i_item_sk) join customer_demographics on
(store_sales.ss_cdemo_sk = custom-er_demographics.cd_demo_sk)
join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)
Compare with Hive
1G 10G 100G0
200
400
600
800
1000
1200
1400
hiveprototype
1G 10G 100G0
200
400
600
800
1000
1200
1400
hiveprototype
First query time compare:2-3X
Sequential execution time:30-50X
Future work• Cube Model:
• Demand-driven & Data-driven
• Cube Data: • Model-driven & Requirement-driven
• More experiments on TPC-DS queries• Report, ad hoc, iterative, data mining,
• MDX/XMLA compliance
Thanks.
Storage for Example
Row Key
Column Family: default
Dimension A
Mask 000001 001000 001001
001001 A1 A2 A3
Dimension B
Mask 000010 100000
100010 B1 B2
Row Key Column Family: default
000111 Mea_count Mea_sum
1 M1
011010 Mea_count Mea_sum
1 M2
Table: Dimension
Table: Cuboid_ABC