Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query PerformanceTILMANN RABL, MEIKEL POESS, HANS-ARNO JACOBSEN, PATRICK AND ELIZABETH O’NEIL

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

ICPE 2013, PRAGUE, 24/04/2013

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 2

Real Life Data is Distributed Uniformly…

◦ Customers zip codes typically clustered around metropolitan areas◦ Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific

periods◦ US retail sales:

◦ peak during Holiday Season◦ December sales are 2x of

January sales

Source: US Census Data

Well, Not Really


Student Seminar Signup Distribution


How Can Skew Effect Database Systems?

Data placement◦ Partitioning◦ Indexing

Data structures◦ Tree balance◦ Bucket fill ratio◦ Histograms

Optimizer finding the optimal query plan◦ Index vs. non-index driven plans◦ Hash join vs. merge join◦ Hash group by vs. sort group by


Agenda Data Skew in Current Benchmarks

Star Schema Benchmark (SSB)

Parallel Data Generation Framework (PDGF)

Introducing Skew in SSB


Data Skew in Benchmarks

TPC-D (1994-1999): only uniform data◦ SIGMOD 1997 - “Successor of TPC-D

should include data skew”◦ No effect until …

TPC-DS (released 2012)◦ Contains comparability zones ◦ Not fully utilized

TPC-D/H variations◦ Chaudhuri and Narayasa: Zipfian distribution on all columns◦ Crolotte and Ghazal: comparability zones

Still lots of open potential


Star Schema Benchmark I

Star schema version of TPC-H◦ Merged Order and Lineitem◦ Date dimension ◦ Dropped Partsupp◦ Selectivity hierarchies

◦ C_City C_Nation C_Region◦ …


Star Schema Benchmark II

Completely new set of queries

4 flights of 3-4 queries◦ Designed for functional coverage and selectivity coverage◦ Drill down in dimension hierarchies◦ Predefined selectivity

select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;

select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_yearmonthnum = 199301 and lo_discount between 1 and 3 and lo_quantity between 26 and 35;

DrilldownQ1.1

Q1.2


Parallel Data Generation Framework

Generic data generation framework

Relational model◦ Schema specified in configuration file◦ Post-processing stage for alternative representations

Repeatable computation◦ Based on XORSHIFT random number generators◦ Hierarchical seeding strategy

Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12.Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest '11.Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11.Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10.


Configuring PDGF Schema configuration

Relational model◦ Tables, fields

Properties◦ Table size, characters, …

Generators◦ Simple generators◦ Metagenerators

Update definition◦ Insert, update, delete◦ Generated as change data capture

<table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator > <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field>[..]

PDGFXML DB


Opportunities to Inject Data Skew in

Foreign key relations◦ E.g., L_PARTKEY

One fact table measures◦ E.g., L_Quantity

Single dimension hierarchy◦ E.g., P_Brand → P_Category → P_Mfgr

Multiple dimension hierarchies◦ E.g., City → Nation in Supplier and Customer

Experimental methodology◦ One experiment series for each of the above◦ Comparison to original SSB◦ Comparison of index-forced, non-index, and automatic optimizer mode◦ SSB scale factor 100 (100 GB), x86 server


Skew in Foreign Key Relations

Very realistic Easy to implement in PDGF

◦ Just add a distribution to the reference

But! Dimension attributes uniformly distributed Dimension keys uncorrelated to dimension attributes Very limited effect on selectivity Focus on attributes in selectivity predicates

<distribution name="Exponential“ lambda="0.26235" />


Lo_Quantity distribution◦ Values range between 0 and 50◦ Originally uniform distribution with:

◦ P(X=x)=0.02◦ Coefficient of variation of 0.00000557

◦ Proposed skewed distribution with:◦

Query 1.1◦ lo_quantity < x, x ∈ [2, 51]

Results◦ Switches too early to non-index plan◦ Switches too late to non-index plan◦ Optimizer agnostic to distribution

Skew in Fact Table Measure – Lo_Quantity

xxXP3.13.0)(


Skew in Single Dimension Hierarchy - Part

P_Category distribution◦ Uniform P(X=x)=0.04◦ Skewed P(X=x)= 0.01 - 48.36◦ Probabilities explicitly defined

Query 2.1 ◦ Restrictions on two dimensions

Results uniform case◦ Index driven superior◦ Optimizer chooses non-index driven

Results skewed case◦ Switches too early to non-index plan


Skewed S_City & C_City◦ Probabilites exponentially

distributed

Query 3.3◦ Restrictions on 3 dimensions◦ Variation on Supplier and Customer

city

Results uniform and skewed cases◦ Automatic plan performs best◦ Cross over between automatic

uniform and skewed too late

Skew in Multiple Dimension Hierarchies – S_City & C_City

Join Cardinality Elapsed Time


Conclusion & Future Work

PDGF implementation of SSB

Introduction of skew in SSB

Extensive performance analysis◦ Several interesting optimizer effects◦ Performance impact of skew

Future Work Further analysis on impact of skew

Skew in query generation

Complete suite for testing skew effects


Thanks

Questions?

Download and try PDGF:

http://www.paralleldatageneration.org

(scripts used in the study available on website above)

http://www.paralleldatageneration.org/


Back-up Slides

Configuring PDGF Generation

Generation configuration

Defines the output◦ Scheduling◦ Data format◦ Sorting◦ File name and location

Post processing◦ Filtering of values◦ Merging of tables◦ Splitting of tables◦ Templates (e.g. XML / queries)

<table name="QUERY_PARAMETERS" exclude="false" > <output name="CompiledTemplateOutput" > [..] <template ></template > </output ></table >

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

Technology

Transcript of Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance