Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

42
# 1 19th Internaonal Conference on Data Management COMAD’2013 @ Ahmedabad, India 20 th December 2013 Muldimensional Database Design via Schema Transformaon Turning TPC-H into the TPC-H*d Muldimensional Benchmark Alfredo Cuzzocrea * and Rim Moussa *[email protected] ICAR-CNR & University of Calabria, Italy [email protected] LaTICE, University of Tunis, Tunisia

Transcript of Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

Page 1: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 1

19th International Conference on Data Management

COMAD’2013 @ Ahmedabad, India20th December

2013

Multidimensional Database Design via Schema Transformation

Turning TPC-H into the TPC-H*d Multidimensional Benchmark

Alfredo Cuzzocrea∗ and Rim Moussa‡

*[email protected] ICAR-CNR & University of Calabria, Italy

[email protected] LaTICE, University of Tunis, Tunisia

Page 2: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

Context

# 220th of Dec. 2013

[email protected]

Data Warehouse Systems

Multidimensional Databases

OLAP TechnologiesVisual Analytics: BI Dashboards, OLAP cubes, pivots tables,

Page 3: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

Motivations & Goals

# 320th of Dec. 2013

[email protected]

• Motivations – BI analysts require Visual OLAP technologies– According to market watchers, such as Pringle & Company

and Gartner, the market for BI platforms will remain one of the fastest growing software markets in most regions,

• Goals: propose, implement & test– Framework for MDB design based on business

requirements,– Appropriate benchmark for OLAP servers benchmarking.

Page 4: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

Motivations & Goals

# 420th of Dec. 2013

[email protected]

• Motivations – BI analysts require Visual OLAP technologies– According to market watchers, such as Pringle & Company

and Gartner, the market for BI platforms will remain one of the fastest growing software markets in most regions,

• Goals: propose, implement & test– Framework for MDB design based on business

requirements,– Appropriate benchmark for OLAP servers.

Page 5: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

Outline• Rules for MDB Design• Turning TPC-H Benchmark into a TPC-H*d: a

Multidimensional Benchmark – Initial MDB Schema– Performances Results

• Derived data based optimizations– Workload Taxonomy– Performance Results

• Related Work• Conclusion

# 520th of Dec. 2013

[email protected]

Page 6: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

Problem• Given,

– A relational warehouse schema – A Workload composed of business queries W = {Q1, Q2, …, Qn}

where Qi is a parameterized query

• Multidimensional DB Schema– How to define OLAP cubes?

• Will there be a single cube or multiple cubes? Are there any rules for definition of virtual cubes?

– Which optimizations are suitable for performance tuning?• Derived data calculus & refresh?• Data partitioning and parallel cube building?

# 6# 620th of Dec. 2013

[email protected]

Page 7: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

IdeaMap each business query to an OLAP cube

» Obtain an initial MDB schema composed of OLAP cubes

Run Optimizations» Derived Data: Derived attributes, Aggregate tables, » Virtual Cubes

# 7# 720th of Dec. 2013

[email protected]

Page 8: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

SQL Stmt Template

SELECT t1.col_i,t2.col_j,…,tn.col_k

aggregate_function(column) as measure_1, …,

aggregate_function(expression) as measure_m

FROM table_1 t1, table_2 t2, …, table_n tn

WHERE ti.col_x operator $query_parameter$

AND ti.col_y = tj.col_z

AND …

GROUP BY t1.col_i, t2.col_j, …, tn.col_k;

# 8# 820th of Dec. 2013

[email protected]

=, < , <=, >=, !=

min, max, sum, avg, count, count-distinct …

Page 9: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

Rules for OLAP Cube Design --Measures’ Definition• Measures feature aggregate functions, such as min, max,count,count-distinct,sum,avg, …

• Simple Measure– Defined over a single attribute,– Exple: SUM(l_extendedprice),

• Measure expressions– Defined over multiple attributes,– Exple: SUM(l_extendedprice*(1 - l_discount))

• Computed Members– Defined over measures or measure expressions,– Exple: M1=SUM(l_extendedprice), M2=COUNT(l_orderkey), CM = M1 / M2

# 9# 920th of Dec. 2013

[email protected]

Page 10: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

Rules for OLAP Cube Design --Fact Table Definition (1) • All attributes involved in measures and measure expressions

belong to the fact table!– Exple: Q10 of TPC-H benchmark,

# 10# 10# 1020th of Dec. 2013

[email protected]

Page 11: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

• Case measurable attributes belong to different tables The fact table is the join of tables, to which measurable attributes belong!– Exple 1: Q9 of TPC-H benchmark, where l_extendedprice, l_discount and l_qty belong to lineitem, and ps_supplycost belongs to partsupp.

– The fact table is the join of lineitem and partsupp tables. Select attributes needed for join with dimension tables (namely, l_partkey, l_orderkey, l_suppkey), and measurable attributes (namely l_extendedprice, l_discount ,l_quantity, ps_supplycost).

# 11

Rules for OLAP Cube Design --Fact Table Definition over multiple tables (2)

# 11# 11# 1120th of Dec. 2013

[email protected]

Page 12: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 12

Q9 SQL statement

Rules for OLAP Cube Design --Fact Table Definition over multiple tables (3)

# 12# 12# 12# 12# 1220th of Dec. 2013

[email protected]

Page 13: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 13

Rules for OLAP Cube Design --Fact Table Definition over multiple tables (4)

# 13# 13# 13# 1320th of Dec. 2013

[email protected]

Page 14: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 14

Rules for OLAP Cube Design --Fact Table Definition & filters' processing (6)

# 14# 14# 14# 1420th of Dec. 2013

[email protected]

• Filters' Processing: the fact table is filtered using non-multidimensional predicates,– Extract all filters involving the fact table from the WHERE

clause, such as• (attr_i operator attr_j), where both attr_i and attr_j

belong to the fact table,• (attr_k operator $fixed value$), such that attr_k

belongs to the fact table,• [not] exists (select … from … where attr_k …),

such that attr_k belongs to the fact table, • attr_k [not] in (list of fixed values), such that attr_k belongs to the fact table,

– Exple 1: Q10 of TPC-H benchmark– Exple 2: Q16 of TPC-H benchmark– Exple 3: Q21 of TPC-H benchmark

Page 15: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 15

Rules for OLAP Cube Design --Fact Table Definition & filters' processing (6)

# 15# 15# 15# 1520th of Dec. 2013

[email protected]

• Filters' Processing: the fact table is filtered using non-multidimensional predicates,– Extract all filters involving the fact table from the WHERE

clause, such as• (attr_i operator attr_j), where both attr_i and attr_j

belong to the fact table,• (attr_k operator $fixed value$), such that attr_k

belongs to the fact table,• [not] exists (select … from … where attr_k …),

such that attr_k belongs to the fact table, • attr_k [not] in (list of fixed values), such that attr_k belongs to the fact table,

– Exple 1: Q10 of TPC-H benchmark– Exple 2: Q16 of TPC-H benchmark– Exple 3: Q21 of TPC-H benchmark

Page 16: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 16

Rules for OLAP Cube Design --Fact Table Definition & filters' processing (7)

# 16# 16# 16# 1620th of Dec. 2013

[email protected]

Q10 SQL statement

Page 17: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 17

Rules for OLAP Cube Design --Fact Table Definition & filters' processing (8)

# 17# 17# 17# 1720th of Dec. 2013

[email protected]

Q16 SQL statement

Page 18: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 18

Rules for OLAP Cube Design --Fact Table Definition & filters' processing (9)

# 18# 18# 18# 1820th of Dec. 2013

[email protected]

Q21 SQL statement

Page 19: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 19

Rules for OLAP Cube Design --Dimensions' Definition (1)

# 19# 19# 19# 1920th of Dec. 2013

[email protected]

First, consider all attributes in the SELECT, WHERE and GROUP BY clauses,– discard measurable attributes, which figure out in measures, – discard attributes which figure out in the WHERE clause, and

are used for joining tables or filtering the fact table,– Compose time dimension along well known time hierarchies,

• Year, quarter, month– Compose geography dimension along well known locations'

hierarchies,• Region, nation, city, district

Page 20: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 20

Rules for OLAP Cube Design --Dimensions' Definition (2)

# 20# 20# 20# 2020th of Dec. 2013

[email protected]

– Exple: Q10 of TPC-H benchmark, all highlighted attributes are considered for dimensions' mount!

• Time dimension o_orderdate requires order_year and order_quarter levels

Page 21: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 21

Rules for OLAP Cube Design --Dimensions' Definition (3)

# 21# 21# 21# 2120th of Dec. 2013

[email protected]

Second, find out hierarchical relations, i.e., one-to-many relationships, and re-organize attributes along hierarchies to form dimensions’ hierarchies,– Example: Q10 of TPC-H benchmark

• Each customer can be related to at most one nation, but a  nation may be related to many customers,customer_dim: n_name (customer_nation) > c_custkey, c_name, c_acctbal, c_address, c_phone, c_comment,

• order_dim: order_year > order_quarter

Page 22: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 22

Rules for OLAP Cube Design --Dimensions' Definition (4)

# 22# 22# 22# 2220th of Dec. 2013

[email protected]

Third, distinguish levels from properties. Properties are in functional dependency with levels,– Example: Q10 of TPC-H benchmark

• For customer_dim, c_custkey is the level, and all of c_name, c_acctbal, c_address, c_phone, c_comment attributes are properties of c_custkey level.

Page 23: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 23

Rules for OLAP Cube Design --Dimensions' Definition & Filters' processing (5)

# 23# 23# 23# 2320th of Dec. 2013

[email protected]

• Filters Processing: not all tuples in the dimension table should be considered, so we have to extract filters defined over dimension tables from the WHERE clause not useful for multidimensional design,– Exple 1: Q12 of TPC-H Benchmark. The OLAP cube C12 counts

the nber of urgent and high priorities orders (hig line count), and the nber of not urgent and not high priorities orders (low line count) by line_ship_mode, line_receipt_year over orders facts, and considering only lines such that commit_date < receipt_date and ship_date < commit_date.

Page 24: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 24

Rules for OLAP Cube Design --Dimensions' Definition & Filters' processing (6)

# 24# 24# 24# 2420th of Dec. 2013

[email protected]

Q12 SQL statement

Page 25: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 25

TPC-H*d --Summary

# 25# 25# 25# 2520th of Dec. 2013

[email protected]

● Multi-dimensional design of TPC-H benchmark

– Minimal changes to TPC-H relational DB schema

– Each SQL statement is mapped into an OLAP cube

● TPC-H*d Workload

– 23 MDX statements for OLAP cubes' run

– 23 MDX statements for OLAP queries' run

Page 26: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 26

TPC-H*d --Screenshots of C10 and Q10 Pivot Tables

# 26# 26# 26# 2620th of Dec. 2013

[email protected]

Page 27: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 27

Performance Results --Software and Hardware Technologies

# 27# 27# 27# 2720th of Dec. 2013

[email protected]

French Grid Platform G5K• Sophia site

• Suno nodes, 32 GB of memory, each CPU is Intel Xeon E5520, • 2.27 GHz, with 2 CPUs per node and 4 cores per CPU

Relational DBMSMysql 5.1

Jpivot OLAP client

Servlet container

Mondrian ROLAP ServerMondrian-3.5.0

Page 28: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 28

Performance Results --TPC-H*d for SF=10

# 28# 28# 28# 2820th of Dec. 2013

[email protected]

● Over 22 business queries: 14 perform as Q1, 4 perform as Q10, 2 perform as Q11, 2 perform as Q9

● The system under test was unable to build big cubes related to business queries: Q3, Q9, Q10, Q13, Q18 and Q20, either for memory leaks or systems constraints (max crossjoin size: 2,147,483,647),

Query workloadd

Cube-Query workload

cube query

Q1 2,147.33 2,777.49 0.29

Q10 7,100.24 n/a -

Q11 2,558.21 3,020.27 1,604.1

Q9 n/a n/a n/a

Page 29: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 29

Optimizations based on Derived Data --Aggregate Tables (1)

# 29# 29# 29# 2920th of Dec. 2013

[email protected]

• An aggregate table (a.k.a. Materialized view) summarizes large number of detail rows into information that has a coarser granularity, and so fewer rows. – Allows faster query processing, – Requires refresh: incremental refresh or a total rebuild.

Page 30: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 30

Optimizations based on Derived Data --Derived Attributes (1)

# 30# 30# 30# 3020th of Dec. 2013

[email protected]

• Alter warehouse schema and calculate attributes,• It should allow gain in performance, CPU and I/O, (some joins

are no more processed),• Choose attributes which are not stale after data refresh or

refresh cost is not important,• Exple of Q10

– Add o_sumlostrevenue for each order,– This avoids join of LineItem and Orders relations. It saves CPU and I/O.

Page 31: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 31

Optimizations based on Derived Data --Derived Attributes vs. OLAP Cube design (1)

# 31# 31# 31# 3120th of Dec. 2013

[email protected]

CUSTOMER

ORDERS

LINEITEM

TIME

NATION

CUSTOMER

ORDERS

TIME

NATION

C10, with o_lost_revenue derived attribute

Original C10

Page 32: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 32

Workload Taxonomy

# 32# 32# 32# 3220th of Dec. 2013

[email protected]

• 2 variables– Cube dimensionality: size of the cross of dimensions’ sizes– Cube density :   ratio of the size of the materialized view and the size

of the cross of dimensions’ sizes• Exple 1: Q4 of TPC-H benchmark

– Cube dimensionality: order_years (7) × nbr_quarters (4) × order_priorities (5) × 1 measure (count orders) = 140

• Cube size is SF independent– Cube density: for SF=0.96 (more than 96% of the cells are not empty)

• dense cube

Page 33: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 33

Workload Taxonomy (2)

# 33# 33# 33# 3320th of Dec. 2013

[email protected]

• Exple 2: Q10 of TPC-H benchmark,– order_year (7) × nbr_quarters (4) × line_return_flag (3) × customer (SF ×

150,000: selected by 25 nation) × 1 measure = 12,600,000 × SF• SF dependent

– with restriction return_flag=R (returned parts), order_year (7) × nbr_quarters (4) × line_return_flag (1) × customer (SF × 150,000: selected by 25 nation) × 1 measure = 4,200,000 × SF

– Cube density: for SF=0.12 • sparse cube

Page 34: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 34

Workload Taxonomy--Recommendations

# 34# 34# 34# 3420th of Dec. 2013

[email protected]

Features TPC-H Business Questions (OLAP Cube)

Medium/high dimensionality •Dense cube•Result is not % TPC-H Scale Factor independentBuild Aggregate Tables

Q1, Q3, Q4, Q5, Q6, Q7, Q8, Q12, Q13, Q14, Q16, Q19, Q2213 business questions

•High dimensionality•Sparse cube, few results, lots of empty cellsBuild Aggregate Tables

Q15, Q182 business questions

•High dimensionality•Result % of Scale FactorAdd Derived Attributes

Q2, Q9, Q10, Q11, Q17, Q20, Q217 business questions

Page 35: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 35

Performnce Results --TPC-H*d for SF=10 with derived data (ms)

# 35# 35# 35# 3520th of Dec. 2013

[email protected]

Query workload

Cube-Query workload

cube query

1.10 1.39 0.21

2.28 27.38 0.78

329.01 n/a -

2738.46 2723.85 1585.21

n/a n/a n/a

Q1

Q21

Q10

Q11

Q9

● Response times of business queries of both workloads, for which aggregate tables were built were improved.

● The impact of derived attributes is mitigated. Performance results show good improvements for Q10 and Q21, and small impact on Q11 (saved operations are not complex).

Query workload

Cube-Query workload

cube query

2,147.33 2,777.49 0.29

578.09 855.46 0.15

7,100.24 n/a -

2,558.21 3,020.27 1,604.1

n/a n/a n/a

Page 36: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 36

Related Work

# 36# 36# 36# 3620th of Dec. 2013

[email protected]

• Methods for MDB design– Not generalized, not tested, no empirical tests conducted by Niemi et al.

2011, Romero et al. 2006, Nair et al. 2007

• Variants of TPC-H benchmark– Star Schema Benchmark (SSB) by O'Neil et al. 2009

• Turning the schema of TPC-H benchmark into a star schema(one fact table and many dimensions)

• Almost half TPC-H benchmark,• SQL workload, No OLAP cubes

– MS Analysis Services Test by La Brie et al. 2002• Turning Q4 into an OLAP cube (MDX stmt)• Performance measurements

Page 37: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 37

Conclusion

# 37# 37# 37# 3720th of Dec. 2013

[email protected]

• We provided a framework for MDB design– Tested for the TPC-H benchmark, – TPC-H is the most prominent DSS benchmark

• Thorough experimentations– Two workloads types :  

• query-workload • cube-then-query workload,

– Open source ROLAP server – Different warehouse volumes SF=1,10

Page 38: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 38

Future Work

# 38# 38# 38# 3820th of Dec. 2013

[email protected]

• Test framework with TPC-DS workload– 7 data marts – A hundred of business queries

• Investigate other optimization methods for OLAP over Big Data – Approximate query processing through data synopsis calculus,

Page 39: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 39

COMAD’[email protected]

20th of December, 2013

Thank you for your Attention

Q & A

Multidimensional Database Design via Schema Transformation

Turning TPC-H into the TPC-H*d Multidimensional Benchmark

Alfredo Cuzzocrea & Rim Moussa

Page 40: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 40

TPC-H*d --DB Schema

# 40# 40# 40# 4020th of Dec. 2013

[email protected]

Page 41: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 41

Optimizations based on Derived Data --Q10 -Derived Attribute

# 41# 41# 41# 4120th of Dec. 2013

[email protected]

Page 42: Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 42

Performance Results --Performance of Derived Data Calculus (ms)

# 42# 42# 42# 4220th of Dec. 2013

[email protected]

Single DB Backend

ps_isminimum (PartSupp, Supplier, Nation, Region are replicated )

862.4

ps_excess_YYYY(PartSupp, Time are replicated and LineItem is fragmented into 4 fragments)

18,195.48

l_profit (LineItem is fragmented into 4 fragments)

4,377.51

agg_c1 343.91

agg_c15 10,904.00