Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

19
Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS-ARNO JACOBSEN, PATRICK AND ELIZABETH O’NEIL MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG ICPE 2013, PRAGUE, 24/04/2013

description

This is a presentation that was held at the ICPE 2013, Prague, 24/04/2013 Abstract: The Star Schema Benchmark (SSB), has been widely used to evaluate the performance of database management systems when executing star schema queries. SSB, based on the well known industry standard benchmark TPC-H, shares some of its drawbacks, most notably, its uniform data distributions. Today’s systems rely heavily on sophisticated cost-based query optimizers to generate the most efficient query execution plans. A benchmark that evaluates optimizer’s capability to generate optimal execution plans under all circumstances must provide the rich data set details on which optimizers rely (uniform and non-uniform distributions, data sparsity, etc.). This is also true for other database system parts, such as indices and operators, and ultimately holds for an end-to-end benchmark as well. SSB’s data generator, based on TPC-H’s dbgen, is not easy to adapt to different data distributions as its meta data and actual data generation implementations are not separated. In this paper, we motivate the need for a new revision of SSB that includes non-uniform data distributions. We list what specific modifications are required to SSB to implement non-uniform data sets and we demonstrate how to implement these modifications in the Parallel Data Generator Framework to generate both the data and query sets.

Transcript of Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

Page 1: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query PerformanceTILMANN RABL, MEIKEL POESS, HANS-ARNO JACOBSEN, PATRICK AND ELIZABETH O’NEIL

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

ICPE 2013, PRAGUE, 24/04/2013

Page 2: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 2

Real Life Data is Distributed Uniformly…

◦ Customers zip codes typically clustered around metropolitan areas◦ Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific

periods◦ US retail sales:

◦ peak during Holiday Season◦ December sales are 2x of

January sales

Source: US Census Data

Well, Not Really

Page 3: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 3

Student Seminar Signup Distribution

Page 4: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 4

How Can Skew Effect Database Systems?

Data placement◦ Partitioning◦ Indexing

Data structures◦ Tree balance◦ Bucket fill ratio◦ Histograms

Optimizer finding the optimal query plan◦ Index vs. non-index driven plans◦ Hash join vs. merge join◦ Hash group by vs. sort group by

Page 5: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 5

Agenda Data Skew in Current Benchmarks

Star Schema Benchmark (SSB)

Parallel Data Generation Framework (PDGF)

Introducing Skew in SSB

Page 6: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 6

Data Skew in Benchmarks

TPC-D (1994-1999): only uniform data◦ SIGMOD 1997 - “Successor of TPC-D

should include data skew”◦ No effect until …

TPC-DS (released 2012)◦ Contains comparability zones ◦ Not fully utilized

TPC-D/H variations◦ Chaudhuri and Narayasa: Zipfian distribution on all columns◦ Crolotte and Ghazal: comparability zones

Still lots of open potential

Page 7: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 7

Star Schema Benchmark I

Star schema version of TPC-H◦ Merged Order and Lineitem◦ Date dimension ◦ Dropped Partsupp◦ Selectivity hierarchies

◦ C_City C_Nation C_Region◦ …

Page 8: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 8

Star Schema Benchmark II

Completely new set of queries

4 flights of 3-4 queries◦ Designed for functional coverage and selectivity coverage◦ Drill down in dimension hierarchies◦ Predefined selectivity

select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;

select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_yearmonthnum = 199301 and lo_discount between 1 and 3 and lo_quantity between 26 and 35;

DrilldownQ1.1

Q1.2

Page 9: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 9

Parallel Data Generation Framework

Generic data generation framework

Relational model◦ Schema specified in configuration file◦ Post-processing stage for alternative representations

Repeatable computation◦ Based on XORSHIFT random number generators◦ Hierarchical seeding strategy

Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12.Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest '11.Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11.Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10.

Page 10: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 10

Configuring PDGF Schema configuration

Relational model◦ Tables, fields

Properties◦ Table size, characters, …

Generators◦ Simple generators◦ Metagenerators

Update definition◦ Insert, update, delete◦ Generated as change data capture

<table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator > <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field>[..]

PDGFXML DB

Page 11: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 11

Opportunities to Inject Data Skew in

Foreign key relations◦ E.g., L_PARTKEY

One fact table measures◦ E.g., L_Quantity

Single dimension hierarchy◦ E.g., P_Brand → P_Category → P_Mfgr

Multiple dimension hierarchies◦ E.g., City → Nation in Supplier and Customer

Experimental methodology◦ One experiment series for each of the above◦ Comparison to original SSB◦ Comparison of index-forced, non-index, and automatic optimizer mode◦ SSB scale factor 100 (100 GB), x86 server

Page 12: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 12

Skew in Foreign Key Relations

Very realistic Easy to implement in PDGF

◦ Just add a distribution to the reference

But! Dimension attributes uniformly distributed Dimension keys uncorrelated to dimension attributes Very limited effect on selectivity Focus on attributes in selectivity predicates

<distribution name="Exponential“ lambda="0.26235" />

Page 13: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 13

Lo_Quantity distribution◦ Values range between 0 and 50◦ Originally uniform distribution with:

◦ P(X=x)=0.02◦ Coefficient of variation of 0.00000557

◦ Proposed skewed distribution with:◦

Query 1.1◦ lo_quantity < x, x ∈ [2, 51]

Results◦ Switches too early to non-index plan◦ Switches too late to non-index plan◦ Optimizer agnostic to distribution

Skew in Fact Table Measure – Lo_Quantity

xxXP3.13.0)(

Page 14: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 14

Skew in Single Dimension Hierarchy - Part

P_Category distribution◦ Uniform P(X=x)=0.04◦ Skewed P(X=x)= 0.01 - 48.36◦ Probabilities explicitly defined

Query 2.1 ◦ Restrictions on two dimensions

Results uniform case◦ Index driven superior◦ Optimizer chooses non-index driven

Results skewed case◦ Switches too early to non-index plan

Page 15: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 15

Skewed S_City & C_City◦ Probabilites exponentially

distributed

Query 3.3◦ Restrictions on 3 dimensions◦ Variation on Supplier and Customer

city

Results uniform and skewed cases◦ Automatic plan performs best◦ Cross over between automatic

uniform and skewed too late

Skew in Multiple Dimension Hierarchies – S_City & C_City

Join Cardinality Elapsed Time

Page 16: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 16

Conclusion & Future Work

PDGF implementation of SSB

Introduction of skew in SSB

Extensive performance analysis◦ Several interesting optimizer effects◦ Performance impact of skew

Future Work Further analysis on impact of skew

Skew in query generation

Complete suite for testing skew effects

Page 17: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 17

Thanks

Questions?

Download and try PDGF:

http://www.paralleldatageneration.org

(scripts used in the study available on website above)

Page 18: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 18

Back-up Slides

Page 19: Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 19

Configuring PDGF Generation

Generation configuration

Defines the output◦ Scheduling◦ Data format◦ Sorting◦ File name and location

Post processing◦ Filtering of values◦ Merging of tables◦ Splitting of tables◦ Templates (e.g. XML / queries)

<table name="QUERY_PARAMETERS" exclude="false" > <output name="CompiledTemplateOutput" > [..] <template ><!-- int y = (fields [0]. getPlainValue ()).intValue (); int d = (fields [1]. getPlainValue ()).intValue (); int q = (fields [2]. getPlainValue ()).intValue (); String n = pdgf.util.Constants.DEFAULT_LINESEPARATOR; buffer.append("-- Q1.1" + n); buffer.append("select sum(lo_extendedprice *"); buffer.append(" lo_discount) as revenue" + n); buffer.append(“ from lineorder , date" + n); buffer.append(“ where lo_orderdate = d_datekey" + n); buffer.append(“ and d_year = " + y + n); buffer.append(“ and lo_disc between " + (d - 1)); buffer.append(“ and " + (d + 1) + n); buffer.append(“ and lo_quantity < " + q + ";" + n); --></template > </output ></table >