MOLAP on Cloud
description
Transcript of MOLAP on Cloud
![Page 1: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/1.jpg)
Hongwei Zhao, Xiaojun Ye
MOLAP on CloudInteractive, Cluster Data Warehouse
Tsinghua University [email protected], [email protected]
![Page 2: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/2.jpg)
MotivationExtend the cube model to support OLAP operations on Big Data:»OLAP operations»Interactive queries
![Page 3: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/3.jpg)
OutlineCube modellingBuilding and queryingExperimenting
![Page 4: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/4.jpg)
Data Transform for CubeTPC-DS tables Star views Cube data
User queries
![Page 5: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/5.jpg)
A Simplified Cube Model
Cube Instance
Cuboid InstanceDimension
InstanceDimensionInstance
CubeMetadata
DimensionInstance
Cuboid Instance
Key MemberKey MemberKey
Dimension
Member
Key Measure NodeKey Measure NodeKey Measure Cell
ABC
AB
A
AC
B
BC
C
*
Base Cuboids
Result
![Page 6: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/6.jpg)
Example: TPC-DS Query7select i_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 from store_sales, customer_demographics, date_dim, item, promotionwhere ss_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and ss_cdemo_sk = cd_demo_sk and ss_promo_sk = p_promo_sk and cd_gender = '[GEN]' and cd_marital_status = '[MS]' and cd_education_status = '[ES]' and (p_channel_email = 'N' or p_channel_event = 'N') and d_year = [YEAR] group by i_item_id order by i_item_id
![Page 7: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/7.jpg)
Relation Schema
Store Sales
Date Dim
Item Promotion
Customer Demographic
s
![Page 8: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/8.jpg)
Converting to BitKeyDimensio
n ADimension B
Dimension C
Measure
A1 B1 C1 M1
A2 B1 C2 M2
A3 B2 C2 M3
Dimension Member
BitKey Dimension Mask
A1 000001 000001B1 000010 000010C1 000100 000100A2 001000 001001B1 000010 000010C2 010000 010100
BitKeys
Value
000111 M1011010 M2Result2
Result1
Intermediate
Result1
Fact1
Fact2
Intermediate
Result1
![Page 9: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/9.jpg)
Cube StorageTableRegionColumnFamilyRowColumnVersionValue
Cell
One table for dimension instances storage:Row Key Dimension NameColumn Family
Default
Column Member BitKeyValue Member ValueMultiple tables for cuboids instancesTable Name Cuboid NameRow Key Cell BitKeyColumn Family
Default
Column Measure NameValue Measure Value
![Page 10: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/10.jpg)
MDX for query 7select { i_item_id } on rows,
{ avg(ss_quantity), avg(ss_list_price),avg(ss_coupon_amt),
avg(ss_sales_price) } on columns
from store_sales_cubewhere (cd_gender .[Male], cd_marital_status .[Single], cd_education_status .[College],
d_year.[2000])
![Page 11: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/11.jpg)
Cube Implementation
Base cuboid building with 4 stages:Dimension constructingHive queryAggregationSaving
Query execution with 4 stages:Loading dimensionOther cuboid constructingMappingReducing
![Page 12: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/12.jpg)
OLAP System
Engi
neCo
lum
nar
Data
base
Master Node
Region NodeRegion
Node
Dispatcher Node
Worker Node
Region Node
Worker NodeWorker Node
cachedat
a
Cube data
Cluster FrameworkDispatcher Node
Worker Nodes
• Distribute dynamically cubes data onto worker nodes
• Parallelize OLAP operations into a concurrent model
![Page 13: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/13.jpg)
Actor of AkkaStateBehaviorMailbox
Lifecycle
Fault tolerance
![Page 14: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/14.jpg)
Execute Query
Query Dispatch
erCuboid
ManagerDimensio
n Manager
Mapper Reducer
1 2
34
require
Cuboid ready
Dimension load
data ready
Extract Query
Hit Cell
Hit Cell
Actors for Query • Load dimension
members• Build other cuboids• Mapping• Reducing
![Page 15: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/15.jpg)
Compiling & MappingQuery 7 Condition: GEN=M and MS=S and ES=College and YEAR=2000
GEN Mask: 000000011 Male 000000010MS Mask: 000011100 Single :000001100ES Mask: 001100000College: 001000000YEAR Mask: 110000000 2000:010000000
Mask: 111111111FilterKey: 011001110
Query Dispatch
er
Mapper1
Mapper2
Mapper3
For each cell in mapper{ If (key & mask
== Filter Key) Send to Reducer}
![Page 16: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/16.jpg)
Region 1
Region 2
Region 3
Worker
Worker
Worker
Master
messages
results
Cache 1
Cache 2
Cache 3
Query Execution• Master sends task messages to workers
• Each worker caches each region data
• Sequential tasks reuse the cache data
First query on 1G consume 48 secs, the following queries with various parameters consume 2.4 secs
![Page 17: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/17.jpg)
Experiments On TPC-DS
1g 10g 100g0
50000000100000000150000000200000000250000000300000000
fact recordscells
1G 10G 100Grecords number
2,653,108
26,532,571
265,325,821
cube cell number
1,836,162
10,190,922
41,892,286
4 nodes:• 2*Intel Xeon CPU E5-2630• 4*600G 15000r/s SAS • 256G RAM• 10Gb Network
Dimensions:1. "i_item_id", 2. "cd_gender", 3. "cd_marital_status", 4. "cd_education_status", 5. "p_channel_email", 6. "p_channel_event", 7. "d_year“Measures: 8. ss_quantity_avg,9. ss_list_price_avg, 10. ss_coupon_amt_avg, 11. ss_sales_price_avg
![Page 18: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/18.jpg)
Build Cube for Query 7
1G
10G
100G
0 1000 2000 3000 4000 5000 6000
queryingaggregatingSaving
running time (seconds)
TPC-
DS
data
siz
e
• Partition by the largest Dimension(i_item_id)
• In-Memory aggregation• Saving stage can be
ignore(cache)
![Page 19: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/19.jpg)
1 2 3 4 50
50
100
150
200
250
300
350
400
4 workers8 workers16 workers
iteration number
runn
ing
tim
e (s
econ
ds)
Execute Query 7First execution on the cube includes • Dimension loading• other cuboids construction • Caching• Mapping• Reducing
Sequential execution includes:• Mapping• Reducing
![Page 20: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/20.jpg)
Hive Query for Fact Data select p_channel_email, p_channel_event, cd_gender, cd_marital_status, cd_education_status, i_item_id,d_year, ss_quantity, ss_list_price, ss_coupon_amt, ss_sales_price from store_sales
join date_dim on (store_sales.ss_sold_date_sk
= date_dim.d_date_sk) join item on (store_sales.ss_item_sk =
item.i_item_sk) join customer_demographics on
(store_sales.ss_cdemo_sk = custom-er_demographics.cd_demo_sk)
join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)
![Page 21: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/21.jpg)
Compare with Hive
1G 10G 100G0
200
400
600
800
1000
1200
1400
hiveprototype
1G 10G 100G0
200
400
600
800
1000
1200
1400
hiveprototype
First query time compare:2-3X
Sequential execution time:30-50X
![Page 22: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/22.jpg)
Future work• Cube Model:
• Demand-driven & Data-driven • Cube Data:
• Model-driven & Requirement-driven• More experiments on TPC-DS
queries• Report, ad hoc, iterative, data mining,
• MDX/XMLA compliance
![Page 23: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/23.jpg)
Thanks.
![Page 24: MOLAP on Cloud](https://reader035.fdocuments.in/reader035/viewer/2022062813/568164a6550346895dd6981f/html5/thumbnails/24.jpg)
Storage for Example
Row Key
Column Family: default
Dimension A
Mask 000001 001000 001001001001 A1 A2 A3
Dimension B
Mask 000010 100000100010 B1 B2
Row Key Column Family: default000111 Mea_count Mea_sum
1 M1
011010 Mea_count Mea_sum1 M2
Table: Dimension
Table: Cuboid_ABC