Post on 10-May-2015
description
TPC-H Performance MPP & Column Store
What is TPCH The TPC Benchmark™H (TPC-H) is a decision support benchmark. It
consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance while maintaining a sufficient degree of ease of implementation. This benchmark illustrates decision support systems that Examine large volumes of data; Execute queries with a high degree of complexity; Give answers to critical business questions.
The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream and the query throughput when queries are submitted by multiple concurrent users.
Overview
TPC-H Schema overview
TPC-H Performance measurements
Partner engagement
TPC-H where is it today
TPC-H challenges
Looking ahead
Q&A
TPC-H Schema overview: Relationships between columns
TPC-H Schema overview : MPP data distribution
Table Column Node 1 Node 2 Node 3
LINEITEM
ORDERKEY 1 2 3
PARTKEY 6 4 8
SUPPKEY 3 18 5
ORDERSORDERKEY 1 2 3
CUSTKEY 4 2 9
PARTSUPPPARTKEY 1 2 3
SUPPKEY 4 5 6
PART PARTKEY 1 2 3
CUSTOMER CUSTKEY 1 2 3
SUPPLIER SUPPKEY 1..N 1..N 1..N
NATION NATIONKEY 1..N 1..N 1..N
REGION REGIONKEY 1..N 1..N 1..N
Collocated
Over network data movement
Collocated Over network data movement
Table Distribution columnLINEITEM L_ORDERKEYORDERS O_ORDERKEYPARTSUPP PS_PARTKEYPART P_PARTKEYCUSTOMER C_CUSTKEYSUPPLIER REPLICATEDNATION REPLICATEDREGION REPLICATED
TPC-H Schema : Metrics Power:
Run order RF1 (Inserts into LINEITEM and ORDERS) 22 read only queries RF2 (Deletes from LINEITEM & ORDERS)
Metric : Query per hour rate TPC-H Power@Size = 3600 * SF / Geomean(22
queries , RF1, RF2) Geometric mean of all queries results in a run Performance improvements to any query equally
improves the metric
Throughput: Run orders
N concurrent Power query streams with different parameters
N RF1 & RF2 streams, this can be run in parallel with the concurrent streams above or after
Metric : Ratio of the total number of queries executed
over the length of the measurement interval TPC-H Throughput@Size = (S*22*3600)/Ts *SF Absolute runtime matters, optimizing for the
longest running query helps
Throughput Power
Run in Parallel
Query Stream 01
Refresh function 1 Inserts into
LINEITEM & ORDERS
Query Stream 02
Query stream 00 14,2,9,20,6…5,7,12
…
Query Stream N
Refresh function 2 Deletes from
LINEITEM & ORDERS
Refresh streams with N pairs of
RF1 & 2
Scale Factor Number of streams100 5300 6
1000 73000 8
10000 930000 10
100000 11
Outline
TPC-H Schema overview
TPC-H Performance measurements
Partner engagement
TPC-H where is it today
TPC-H challenges
Looking ahead
Q&A
TPC-H Performance measurements Invest in tools to analyze
plans, some consider plan analysis an art, breaking down the plan to key metrics helps a lot
Capture enough information in the execution plan to unveil performance issues: Estimate Vs. Actual number of
rows etc.. Amount of data spilled per disk Rows touched Vs. rows qualified
during scan Logical Vs. Physical reads CPU & Memory consumed per
plan operator Skew in number of rows
processed per thread per operator
Instrument the code to provide cycles per row for key scenarios: Scan Aggregate Join
Set performanc
e goals
Measure Performanc
e
Start looking at SMP & MPP
plans
Check CPU & IO
utilization
Fix performanc
e issues
Repeat
TPC-H Performance measurements Scalability within a single server
Vary the number of processors Vary scale factor : 100G, 300G Identify queries that don’t have linear scaling Capture:
CPU & IO utilization per query with at least 1 second sampling rate
Capture hot functions and waits if any Capture CPI ideally per function Capture execution plans
Get busy crunching the data
Scalability across multiple servers Vary the number of servers in the systems Vary amount of data per server Capture:
CPU , disk & network IO Distributed plans Look for queries that have excessive cross node
traffic Identify suboptimal plans where
predicates/aggregates are not pushed down
More focused performance effort
MPP scaling
Data scaling
SMP scaling
Outline
TPC-H Schema overview
TPC-H Performance measurements
Partner engagement
TPC-H where is it today
TPC-H challenges
Looking ahead
Q&A
Partner engagements Can be considered as one of the secret sauces for highly
performing software
Partners (HW/Infrastructure) tend to have vested interest in showcasing Performance and Scalability of their products.
Allows software companies to leverage HW expertise and provide access to low level tools that are not publically available (Through NDA).
Partners occasionally provide HW for Performance benchmarks, prototype evaluation, release publications
Partners can be a great assist for : Providing low level analysis Collaborate in publications, benchmarks, proof of concepts etc.. Provide HW for Performance testing, evaluation, improvement
(large scale experiments are expensive)
Partner engagements NVRAM: Random-access memory that retains its information
when power is turned off (non-volatile). This is in contrast to dynamic random-access memory (DRAM)
“Promises”: Latency within the same order of magnitude of DRAM Cheaper than SSDs +10TB of NVRAM in a 2-socket system within the next 4 years Still in prototype phase Could eliminates need for spinning disks or SSDs altogether
In-memory database are likely to be early adopters of such technology
Good reading: http://research.microsoft.com/en-us/events/trios/trios13-final5.pdf http://www.hpl.hp.com/techreports/2013/HPL-2013-78R1.pdf
Partner engagements Diablo technologies SSD in DRAM slothttp://www.diablo-technologies.com/
Partner engagements Diablo technologies SSD in DRAM slot
DIMM capacity of 200GB & 400GB, technology is rebranded by IBM and VmWare Readyhttp://www.diablo-technologies.com/
Outline
TPC-H Schema overview
TPC-H Performance measurements
Partner engagement
TPC-H where is it today
TPC-H challenges
Looking ahead
Q&A
TPC-H where is it today Why do benchmarks?
Stimulate technological advancements
Why TPCH? Introduce a set of technological challenges whose resolution will significantly improve the
performance of the product
As benchmark is it relevant to current DW applications ?
Gartner Magic quadrant references:
“Vectorwise delivered leading 1TB non-clustered TPC Benchmark H (TPC-H) results in 2012”
Big players are Oracle, Vectorwise, Microsoft, Exasol and Paraccel
Most significant innovation came from: Kickfire acquired by Teradata, FPGA-based "Query Processor Module” with an instruction set tuned
for database operations ParAccel acquired by Actian, shared-nothing architecture with a columnar orientation, adaptive
compression, memory-centric design Exasol .. column-oriented way and proprietary InMemory compression methods are used, database
also has automatic self optimization (create indexes, stats , distribute tables etc.. )
So where does it come in handy? Identify system bottlenecks Push performance focused features into the product TPC-H schema is heavily used for ETL and virtualization benchmarks Introduces lots of interesting challenges to the DMBS
What about TPC-DS, it has a more realistic ETL process , snow flake schema, but no one has published a TPC-DS benchmark yet
TPC-H where is it today
Number of publications is on the decline
99 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Number of publications
9 1 5 12 31 15 42 31 20 13 15 10 20 5 6
2.5
7.5
12.5
17.5
22.5
27.5
32.5
37.5
42.5
Number of TPCH publications per year
Num
ber
of
publicati
ons
• First cloud based benchmark? When will we see this?
Outline
TPC-H Schema overview
TPC-H Performance measurements
Partner engagement
TPC-H where is it today
TPC-H challenges
Looking ahead
Q&A
TPC-H challenges : Aggregation Almost all TPCH queries do aggregation
Unless there is a sorted index (B-tree) on group by column aggregating in Hash table makes most sense opposed to ordered aggregation
Correctly sizing the hash table dictates performance If cardinality under estimates number of distinct values lots of chaining occurs
and HT can eventually spill to disk. If CE overestimates resources are not used optimally
For low distinct count doing hash table per thread (local) then doing a global aggregation improves performance
For small group by on strings, present group by expressions as integers (index in array) opposed to using a hash table (Reduce cache footprint)
For group by on Primary key (C_CUSTKEY) no need to include other columns from CUSTOMER in the Hash table
Main benefits from PK/FK is aggregate optimizations
Queries sensitive to aggregation performance: 1, 3, 4, 10, 13, 18, 20, 21
TPC-H challenges : Aggregation
Q1Reduces 6 billion rows
to 4
Sensitive to string matching
Benefits from doing local aggregation
Q10Group by on most Customer columns
If PK on C_CUSTKEY exists could use C_CUSTKEY for
aggregation
Further optimization push down of aggregate on
O_CUSTKEY and TOP
18Group by on
L_ORDERKEY results in 1.5 billion rows (4x
reduction)
Local aggregation usually hurts performance
Hash table for aggregation alone can
take 25GB of RAM
TPC-H challenges : JoinsSelect a schema which leverages locality
Examples : ORDERS x LINEITEM on L_ORDERKEY=O_ORDERKEY by hash partitioning on ORDERKEY
Q5,Q9,Q18 can spill and have bad performance if the correct plan is not picked
Q9 will cause over the network communication for MPP systems, unless PARTSUPP, PART and SUPPLIER are replicated which is not feasible for large scale factors
TPCH joins are highly selective, hence efficient bloom filters are necessary
Simplistic guide : Find the most selective filter/aggregation and this is where you start
TPC-H challenges : Expression evaluation
Arithmetic operation
performance
Store decimals as integers and save some bits
19123 Vs. 191.23
Rebase of some of the columns to use less bits
Keep data in the most compact form to best exploit SIMD instructions
Detecting common
sub expressions
sum(l_extendedprice) as
sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-
l_discount)*(1+l_tax)) as
sum_charge,
Expression filter push down (Q7,
Q19)
Q7 Take the superset or
UNION of filters and push down
to the scan
Q19 Take the union of
individual predicates
Column projection
vs expression evaluation
Cardinality estimates
should help decide to Project
columns A & B or or (A *
(1 - B) ) before a filter
on C
TPC-H challenges : Correlated subqueries
Push down of predicates into subquery when applicable
When sub queries are flattened batch processing outperforms row by row
Buffer overlapped intermediate results
Partial query reuse
Challenging for MPP systems (don’t redistribute or shuffle the same data twice)
TPC-H challenges : Parallelism and concurrency
Current 2P servers have +48 cores, +½ TB of RAM & +10GB/sec of disk IO BW, this means that within a single box the engine needs to provide meaningful scaling
Further sub-partitioning data on a single server alleviates single server scaling problems
TPC-H queries tend to use lots of workspace memory for Joins and aggregations.
Precise and dynamic memory allocation keeps queries from spilling to under high concurrency
TPC-H challenges : Scan performance
Disk read performance is crucial, should validate that when system is not CPU bound IO subsystem is efficiently used.
Ability to filter out pages or segments from the scan is crucial
In memory scan performance can be increased if we decrease the search scope and thereby the amount of data that needs to be streamed from main memory to the CPU
TPC-H challenges : Scan performance
Store dictionaries in sorted order or in a BST to make• Compress the filter or predicate to do numeric
comparison opposed to decompress and match on strings
• Quickly validates if the value exists in the segment
TPC-H challenges : scan performance What do we do for highly selective filters?
Implement paged indexes for columns of interest
Partition a column into pages, store bitmap indices for each compressed value, bits reflect which rows have the respective value, instead of scanning the entire segment for the matching row , we only read the block which has the matching values aka bits set. http://db.disi.unitn.eu/pages/VLDBProgram/pdf/IMDM/paper2.pdf
In MPP a single SQL statement results in multiple SQL statements that get executed locally on each node
Some TPCDS queries can result in +20 SQL statements that need be executed on each leaf node locally
Steaming of data should result in better performance but there are cases when this strategy fails.
Placing data on disk after each steps allows the Query optimizer to reevaluate the plan
TPC-H challenges : Intermediate steps in MPP
Query :
Select count(*) from PART, PARTSUPP , LINEITEM where P_BRAND=“NIKE” and PS_COMMENT like “%bla%” and P_PARTKEY=PS_PARTKEY and L_PARTKEY = PS_PARTKEY group by P_BRAND
Schema : PART distributed on P_PARTKEY PARTSUPP distributed on PS_PARTKEY LINEITEM distributed on L_ORDERKEY
Create bloom filters BF1 on PART, push filter on PARTSUPP and create BF2 , replicate bloom filter on all leaf nodes apply filter on LINEITEM and only shuffle qualifying rows on
Optimizer should chose between semi join reduction and replicating PART x PARTSUPP
Multiple copies of a set of columns distributed differently can improve performance of such issue but at high cost.
TPC-H challenges : Improving join performance for incompatible joins
Outline
TPC-H Schema overview
TPC-H Performance measurements
Partner engagement
TPC-H where is it today
TPC-H challenges
Looking ahead
Q&A
SQL to map reduce jobs? Crunching data in relational database is always faster than HADOOP, bring data from HADOOP into columnar format , perform analytics with efficient generated code
Full integration with analytics tools as SAS , R , Tableau , Excel etc…
Support PL/SQL syntax (Oracle Compete)
Eliminate the aggregating node to reduce system cost for a small number of nodes, Exasol does it.
Looking ahead
Competitive analysis
Exasol 1TB 240 threads, 20 pro-
cessors
Exasol 1TB 768 threads, 64 pro-
cessors
Exasol 3TB 960 threads, 80 pro-
cessors
MemSql 83GB 480 threads, 40
sockets
Ms SqlServer 10TB, 160
threads, 8 pro-cessors
Oracle 11c 10TB, 512 threads, 4
processors
Sec/GB/Thread
1.44 1.4592 1.504 46.66544784 8.1152 40.68864
2.5
7.5
12.5
17.5
22.5
27.5
32.5
37.5
42.5
47.5
TPCH Q1 analysis Sec/GB/Thread (Lower is better)Assuming all processors have the same speed!!!!
Sec/G
B/T
hre
ad
Referances:• http://www.tpc.org/tpch/results/tpch_perf_results.asp• http://www.esg-global.com/lab-reports/memsqle28099s-distributed-
in-memory-database/
AppendinxGMQ 2013
http://www.gartner.com/technology/reprints.do?id=1-1DU2VD4&ct=130131&st=sb
GMQ 2014
http://www.gartner.com/technology/reprints.do?id=1-1M9YEHW&ct=131028&st=sb
TPC-H column store Avoid virtual function calls, branching use templates
Scan usually dominates CPU profile
Vector/Batch processing is a must
If done correctly code is very sensitive to branching, data dependency, exploit instruction parallelism when possible
Use SIMD instructions , leverage already existing libraries to encapsulate SSE instructions complexity // define and initialize integer vectors a and b Vec4i a(10,11,12,13); Vec4i b(20,21,22,23); // add the two vectors Vec4i c = a + b;
http://www.agner.org/optimize/vectorclass.pdf
TPC-H PlansBehold the power of the optimizer
If plan is wrong you are doomed… Very good read for TPCH Q8
http://www.slideshare.net/GraySystemsLab/pass-summit-2010-keynote-david-dewitt
JSON documentsMost efficient way to store Json documents
Great compression and quick retrieval, ask me how to ….
Q1
Used as benchmark for computational power
Arithmetic operation performance
Aggregating to same hash buckets
Common sub expressions pattern matching
Scan performance sensitive
String matching for aggregation (Could do matching on compressed format)
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '[DELTA]' day (3) group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus;
Challenges
Q2
Correlated sub query
Push down of predicates to the correlated subquery
Highly selective (Segment size plays a big role)
Tricky to generate optimal plan
Depending on which tables are partitioned and which are replicated, plan performance varies a lot.
select s_acctbal,s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = [SIZE] and p_type like '%[TYPE]' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = '[REGION]' and ps_supplycost = ( select from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = '[REGION]' ) order by s_acctbal desc, n_name, s_name, p_partkey;
Challenges
Q3
Collocated join between orders & lineitem
Detect correlation between shipdate, orderdat
Bitmap filters on lineitem are necessary
Replicating (select c_custkey from customers where c_mktsegment = ‘[SEGMENt]’)
select TOP 10 l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = '[SEGMENT]' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '[DATE]' and l_shipdate > date '[DATE]' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate;
Challenges